Section 3: Reliability and validity
From Research Methods in Psychology
Contents |
[edit] Reliability
|
Crucial Concept: Reliability is the ability of a measure to give consistent scores. |
When we say that a person or a car is reliable, we mean that we can predict to some extent what they are going to do. We know the person will turn up on time, and we know the car will start. Much the same is said of a measure in psychology. If a measure is reliable, we know that it is a good measure, and that it will do what it should do.
Reliability in psychological measurement can have one of two different but related, meanings. The first meaning of reliability is internal consistency, that is the extent to which all parts of a measure are measuring the same thing. The second meaning of reliability is stability over time, that is the extent to which the measure is likely to change over time. You must be clear about which one of these definitions you are referring to when you use the word reliability.
[edit] Internal Consistency
|
Crucial Concept: Internal consistency concerns the extent to which different parts of a test are measuring the same thing. |
To have internal consistency is important for all measures in psychology, but it is particularly important in the sort of tests which use a series of items (questions or statements) such as personality, ability or attitude tests.
We need to check the internal consistency of the set of items on the test. There are several to check the internal consistency, we will look at two here.
[edit] Split Half Correlation
One way of doing this is by calculating a split-half correlation. To do a split half correlation, we take a test, which has a number of items, and we randomly divide the items into two equal sized sets – this is shown in Figure 1. The next stage is to correlate the results of the two halves of the test. (A correlation is a measure of agreement between two sets of scores.) If the correlation is high, it means that people who score high on one set of items will also score highly on the other set of items, meaning that the scores are internally consistent. (See Chapter 8 for more information on correlations).
Figure 1: Calculating a split-half correlation.
A test, comprising 10 items, is randomly split into two tests, comprising five items each. The scores on the two tests are then correlated to give a split-half correlation.
[edit] Coefficient (Cronbach’s) Alpha
The problem, as you may have realised, with using a split half correlation is that there are an awful lot of ways that we could split a set of items, and each of those will give a different answer. A better approach is to use something called coefficient alpha, or sometimes Cronbach’s alpha, which was developed by Cronbach (1951). Coefficient alpha is the average of the correlations of all the possible ways of dividing the test into two sets. The values of the results of the calculation will (or should) range in value from 0 to 1. (There are several different, equally correct, interpretations of Cronbach’s alpha, all of which are equally correct. You may well read about others in other texts.)
Values of 0.7 and above are usually considered adequate values of coefficient alpha (but it can be more complex, deciding on what you want to do - see Nunnally and Bernstein, 1992).
[edit] Stability over Time
The second meaning of reliability is stability over time. When we say a measure is reliable, we mean that it is stable over time. If we measure people on an attitude test, and we measure them all 3 months later, unless something dramatic has happened in the meantime, we would expect them all to have broadly similar attitude scores. If their scores had changed, it might suggest to us that the measure was not measuring a stable attitude at all, but instead had been measuring a transient mood, or something else that was easily changed by circumstances.
To find out if a measure is stable over time, you would present a group of participants with a measure, wait some period of time (in journal articles you will be able to find examples of people who have waited 10 minutes, and people who have waited 15 years or more), and then present people with the same measure. You then see, using a correlation coefficient (see Chapter 9) the extent to which people’s scores have changed.
You might find two problems. The first problem is that if you only wait a short period, people will be able to remember what they said last time, and will repeat what they said before - because they do not want you to think that they are inconsistent. The second problem is that if you wait too long, people really might actually have changed their attitude. So, if you then find that people have different scores, you don’t know if this is because your measure is unreliable, or it is reliably recording a change in attitude. For this reason, two weeks is considered a reasonable amount of time to wait between tests.
[edit] Validity of Measures
|
Crucial Concept: The validity of a measure is the extent to which it measures what it is supposed to measure. |
By assessing the reliability of a measure, we have ensured that our measure is a good measure. However, we have only ensured that it is a good measure of something. We have not examined whether it is a good measure of what we want it to actually measure. Whether an instrument measures what we want it to measure is called validity.
There are many different aspects of validity, and we will consider three: construct validity, predictive validity and content validity.
[edit] Construct Validity
The notion of construct validity was introduced by Cronbach and Meehl (1955) (That's the same Cronbach we saw a minute ago, with alpha). A construct is built in the minds of psychologists, and does not exist in any concrete way, but it does exist in a theoretical way as an idea. Constructs are things like intelligence, anxiety, depression or thinking time. We know about them, we know what we are talking about when we talk about them, but it is hard to pin them down. To establish the construct validity of a measure we would use our knowledge of the construct, and compare scores on our measure with other measures that are supposed to measures other aspects of the construct. High scorers on intelligence tests should also be high scores on other tests which measure things related to intelligence, but they should not (necessarily) be high scorers on a test of extraversion.
[edit] Predictive Validity
Some texts you may come across distinguish between predictive validity, criterion validity, and concurrent validity however, as Nunnally and Bernstein (1994) point out “using different terms, however, implies that the logic and procedures of validation are different, which is not true.” (Page 94, italics in original). We will therefore consider all of these together.
To establish predictive validity we assess how well our measure relates to some external criterion. This external criterion can be a measure taken at the time, or a measure taken considerably later.
Knowing what to select as the criterion can be a problem: sometimes the criterion is clearly defined, sometimes it is not.
A-levels are used as a criterion to establish whether a student will be allowed to enter into university. We assume that good A-level results tell us that the student will do well at university. Are A-level results a valid way of predicting success at university? We can find out by comparing a level results with university career results. Therefore, a suitable criterion for the predictive validity of A levels is therefore University success.
With many other tests, it is not so easy. A university may want to test the predictive validity of a test, which can give some idea of who is a good lecturer, but what is a good lecturer? If we were to ask a range of students, the head of department and the other lecturers, we would get a wide range of different answers.
[edit] Content Validity
Content validity requires us to use our knowledge of the content of a domain to assess the validity of a test or measure. A final year degree mark is a measure of a student’s ability and knowledge in a particular subject. If the assessment contained only questions about research methods and statistics, it would not have content validity, because we know that psychology is more than statistics. On the other hand, if your final mark contained no assessment related to research methods and statistics, it would not have content validity, because we know that research methods and statistics is an important component of psychology. We cannot assess content validity without some knowledge of the domain that we are trying to measure – if we didn’t know that psychological knowledge included as a component knowledge of research methods and statistics we would not be able to assess the content validity of your final degree mark.
[edit] Section Summary
In this section we have looked at reliability and validity. A measure is reliable if it measures something well. We considered reliability in terms of classical test theory, which says that any measure is a measure of true score, plus some random error. We looked at two different meanings of reliability: internal consistency and stability over time. Internal consistency is assessed using the split-half correlation, or coefficient (Cronbach’s) alpha. Stability over time is assessed using a test-retest correlation.
