Reliability is essentially the most important aspect of a measurement instrument. A ruler is a great example of a reliable instrument for measuring because as time goes on, every time it measures the same object, the answer will be approximately the same. Any variation in the measurement such as a fraction of an inch will be referred to as a measurement error. Errors like these are most likely because of a mistake or inconsistency on part of the individual doing the measuring with the ruler. A ruler also has internal consistency because the first inch is the same length as all the other inches (McIntire & Miller, 2007).
Reliability is an essential standard when determining whether or not data acquired from a psychological test is trustworthy. A reliable test is one that can be trusted in regards to measuring people in the same way each time it is used. It is necessary for a test to be reliable if it is going to be used to compare people as well. If a test is proven to be reliable, it can be concluded that the changes in the individual’s scores are truly because of changes in that individual. However, even though a test proves to be reliable, it does not mean that the test is necessarily valid (McIntire & Miller, 2007).
There are three common methods that test developers use when checking for reliability. Each method takes into consideration the various conditions or factors that may influence differences among the test scores. By using the test-retest method, a test developer will give the same exact test to the same people or group of test takers. This will occur on two separate occasions. The scores from both the first and second administration will then be correlated in order to acquire the reliability coefficient. However, a danger in using the test-retest method in terms of estimating reliability is that the test takers may score higher on the second administration because of practice effects. Practice effects occur when test takers are able to perform the second test more quickly and correctly because of their ‘practice’ from taking the first test (McIntire & Miller, 2007).
In order to overcome practice effects or differences in test administration and test takers from one occasion to the next, two forms of the same test will be given by psychologists. These tests will be similar in every way possible and will be given to the same people at the same time or at least in the same day. The two forms of a test that are involved in such a method are often referred to as parallel of alternate forms. In order to stay safe against order effects, which are changes that occur in scores as a result of the order in which the tests were administered, half of the test takers will complete Form A first while the other half completes Form B first. The greatest danger in using alternate forms however is that the forms will not be equivalent (McIntire & Miller, 2007).
If someone is only able to take the test once, then researchers will divide the test in half and will have to correlate the results of the first half with the second half. This method is known as split-half reliability or the split-half method and includes using the Spearman-Brown procedure in order to adjust the test length correlation coefficient. An even more accurate way to measure for internal consistency is to take individual’s scores and compares them in every way possible when splitting the test in half. The coefficient alpha formula and the KR-20 allow researchers to correlate the answer of each question on the test with the answers from all of the other questions in order to estimate the reliability. This will compensate for errors that are incorporated through lack of equivalence in regards to the two halves. It is also important to note that estimating reliability by using internal consistency methods is only truly appropriate when using homogeneous tests that measure a single characteristic or trait (McIntire & Miller, 2007).
Reliability in terms of scoring is also extremely important in terms of testing. Any test that requires the scorer to judge the test takers’ answers in any way or tests that require the scorer to physically observe the test taker’s behavior may produce errors which are contributed by the scorer. In order to estimate scorer reliability, researchers will have to use two or more people to score the exact same test and then will correlate their scores to determine the consistency of their judgments (McIntire & Miller, 2007).
It is important to note that no measurement instrument is 100% reliable or consistent. In order to present this idea, researchers acknowledge that test scores contain two parts: T and E, or in other words, a true score, and error. There are two kinds of errors that can exist in test scores, random error, which is the unexplained difference that exists between both the true score and the acquired score, and systematic error, which is when a single error source consistently increases or decreases the true score by exactly the same amount (McIntire & Miller, 2007).
McIntire, S. A. & Miller, L. A. (2007). Foundations of psychological testing: a practical approach. 2nd ed. Thousand Oaks, CA: Sage Publications.