Home » Posts tagged 'construct validity'
Tag Archives: construct validity
Construct validity for IQ is fleeting. Some people may refer to Haier’s brain imaging data as evidence for construct validity for IQ, even though there are numerous problems with brain imaging and that neuroreductionist explanations for cognition are “probably not” possible (Uttal, 2014; also see Uttal, 2012). Construct validity refers to how well a test measures what it purports to measure—and this is non-existent for IQ (see Richardson and Norgate, 2014). If the tests did test what they purport to (intelligence), then they would be construct valid. I will show an example of a measure that was validated and shown to be reliable without circular reliance of the instrument itself; I will show that the measures people use in attempt to prove that IQ has construct validity fail; and finally I will provide an argument that the claim “IQ tests test intelligence” is false since the tests are not construct valid.
Jung and Haier (2007) formulated the P-FIT hypothesis—the Parieto-Frontal Intelligence Theory. The theory purports to show how individual differences in test scores are linked to variations in brain structure and function. There are, however, a few problems with the theory (as Richardson and Norgate, 2007 point out in the same issue; pg 162-163). IQ and brain region volumes are experience-dependent (eg Shonkoff et al, 2014; Betancourt et al, 2015; Lipina, 2016; Kim et al, 2019). So since they are experience-dependent, then different experiences will form different brains/test scores. Richardson and Norgate (2007) state that such bigger brain areas are not the cause of IQ, rather that, the cause of IQ is the experience-dependency of both: exposure to middle-class knowledge and skills leads to a better knowledge base for test-taking (Richardson, 2002), whereas access to better nutrition would be found in middle- and upper-classes, which, as Richardson and Norgate (2007) note, lower-quality, more energy-dense foods are more likely to be found in lower classes. Thus, Haier et al did not “find” what they purported too, based on simplistic correlations.
Now let me provide the argument about IQ test experience-dependency:
Premise 1: IQ tests are experience-dependent.
Premise 2: IQ tests are experience-dependent because some classes are more exposed to the knowledge and structure of the test by way of being born into a certain social class.
Premise 3: If IQ tests are experience-dependent because some social classes are more exposed to the knowledge and structure of the test along with whatever else comes with the membership of that social class then the tests test distance from the middle class and its knowledge structure.
Conclusion 1: IQ tests test distance from the middle class and its knowledge structure (P1, P2, P3).
Premise 4: If IQ tests test distance from the middle class and its knowledge structure, then how an individual scores on a test is a function of that individual’s cultural/social distance from the middle class.
Conclusion 2: How an individual scores on a test is a function of that individual’s cultural/social distance from the middle class since the items on the test are more likely to be found in the middle class (i.e., they are experience-dependent) and so, one who is of a lower class will necessarily score lower due to not being exposed to the items on the test (C1, P4)
Conclusion 3: IQ tests test distance from the middle class and its knowledge structure, thus, IQ scores are middle-class scores (C1, C2).
Still further regarding neuroimaging, we need to take a look at William Uttal’s work.
Uttal (2014) shows that “The problem is that both of these approaches are deeply flawed for methodological, conceptual, and empirical reasons. One reason is that simple models composed of a few neurons may simulate behavior but actually be based on completely different neuronal interactions. Therefore, the current best answer to the question asked in the title of this contribution [Are neuroreductionist explanations of cognition possible?] is–probably not.”
Uttal even has a book on meta-analyses and brain imaging—which, of course, has implications for Jung and Haier’s P-FIT theory. In his book Reliability in Cognitive Neuroscience: A Meta-meta Analysis, Uttal (2012: 2) writes:
There is a real possibility, therefore, that we are ascribing much too much meaning to what are possibly random, quasi-random, or irrelevant response patterns. That is, given the many factors that can influence a brain image, it may be that cognitive states and braib image activations are, in actuality, only weakly associated. Other cryptic, uncontrolled intervening factors may account for much, if not all, of the observed findings. Furthermore, differences in the localization patterns observed from one experiment to the next nowadays seems to reflect the inescapable fact that most of the brain is involved in virtually any cognitive process.
Uttal (2012: 86) also warns about individual variability throughout the day, writing:
However, based on these findings, McGonigle and his colleagues emphasized the lack of reliability even within this highly constrained single-subject experimental design. They warned that: “If researchers had access to only a single session from a single subject, erroneous conclusions are a possibility, in that responses to this single session may be claimed to be typical responses for this subject” (p. 708).
The point, of course, is that if individual subjects are different from day to day, what chance will we have of answering the “where” question by pooling the results of a number of subjects?
That such neural activations gleaned from neuroimaging studies vary from individual to individual, and even time of day in regard to individual, means that these differences are not accounted for in such group analyses (meta-analyses). “… the pooling process could lead to grossly distorted interpretations that deviate greatly from the actual biological function of an individual brain. If this conclusion is generally confirmed, the goal of using pooled data to produce some kind of mythical average response to predict the location of activation sites on an individual brain would become less and less achievable“‘ (Uttal, 2012: 88).
Clearly, individual differences in brain imaging are not stable and they change day to day, hour to hour. Since this is the case, how does it make sense to pool (meta-analyze) such data and then point to a few brain images as important for X if there is such large variation in individuals day to day? Neuroimaging data is extremely variable, which I hope no one would deny. So when such studies are meta-analyzed, inter- and intrasubject variation is obscured.
The idea of an average or typical “activation region” is probably nonsensical in light of the neurophysiological and neuroanatomical differences among subjects. Researchers must acknowledge that pooling data obscures what may be meaningful differences among people and their brain mechanisms. THowever, there is an even more negative outcome. That is, by reifying some kinds of “average,” we may be abetting and preserving some false ideas concerning the localization of modular cognitive function (Uttal, 2012: 91).
So when we are dealing with the raw neuroimaging data (i.e., the unprocessed locations of activation peaks), the graphical plots provided of the peaks do not lead to convergence onto a small number of brain areas for that cognitive process.
… inconsistencies abount at all levels of data pooling when one uses brain imaging techniques to search for macroscopic regional correlates of cognitive processes. Individual subjects exhibit a high degree of day-to-day variability. Intersubject comparisons between subjects produce an even greater degree of variability.
The overall pattern of inconsistency and unreliability that is evident in the literature to be reviewed here again suggests that intrinsic variability observed at the subject and experimental level propagates upward into the meta-analysis level and is not relieved by subsequent pooling of additional data or averaging. It does not encourage us to believe that the individual meta-analyses will provide a better answer to the localization of cognitive processes question than does any individual study. Indeed, it now seems plausible that carrying out a meta-analysis actually increases variability of the empirical findings (Uttal, 2012: 132).
So since reliability is low at all levels of neuroimaging analysis, it is very likely that the relations between particular brain regions and specific cognitive processes have not been established and may not even exist. The numerous reports purporting to find such relations report random and quasi-random fluctuations in extremely complex systems.
Construct validity (CV) is “the degree to which a test measures what it claims, or purports, to be measuring.” A “construct” is a theoretical psychological construct. So CV in this instance refers to whether IQ tests test intelligence. We accept that unseen functions measure what they purport to when they’re mechanistically related to differences in two variables. E.g, blood alcohol and consumption level nd the height of the mercury column and blood pressure. These measures are valid because they rely on well-known theoretical constructs. There is no theory for individual intelligence differences (Richardson, 2012). So IQ tests can’t be construct valid.
The accuracy of thermometers was established without circular reliance on the instrument itself. Thermometers measure temperature. IQ tests (supposedly) measure intelligence. There is a difference between these two, though: the reliability of thermometers measuring temperature was established without circular reliance on the thermometer itself (see Chang, 2007).
In regard to IQ tests, it is proposed that the tests are valid since they predict school performance and adult occupation levels, income and wealth. Though, this is circular reasoning and doesn’t establish the claim that IQ tests are valid measures (Richardson, 2017). IQ tests rely on other tests to attempt to prove they are valid. Though, as seen with the valid example of thermometers being validated without circular reliance on the instrument itself, IQ tests are said to be valid by claiming that it predicts test scores and life success. IQ and other similar tests are different versions of the same test, and so, it cannot be said that they are validated on that measure, since they are relating how “well” the test is valid with previous IQ tests, for example, the Stanford-Binet test. This is because “Most other tests have followed the Stanford–Binet in this regard (and, indeed are usually ‘validated’ by their level of agreement with it; Anastasi, 1990)” (Richardson, 2002: 301). How weird… new tests are validated with their agreement with other, non-construct valid tests, which does not, of course, prove the validity of IQ tests.
IQ tests are constructed by excising items that discriminate between better and worse test takers, meaning, of course, that the bell curve is not natural, but forced (see Simon, 1997). Humans make the bell curve, it is not a natural phenomenon re IQ tests, since the first tests produced weird-looking distributions. (Also see Richardson, 2017a, Chapter 2 for more arguments against the bell curve distribution.)
Finally, Richardson and Norgate (2014) write:
In scientific method, generally, we accept external, observable, differences as a valid measure of an unseen function when we can mechanistically relate differences in one to differences in the other (e.g., height of a column of mercury and blood pressure; white cell count and internal infection; erythrocyte sedimentation rate (ESR) and internal levels of inflammation; breath alcohol and level of consumption). Such measures are valid because they rely on detailed, and widely accepted, theoretical models of the functions in question. There is no such theory for cognitive ability nor, therefore, of the true nature of individual differences in cognitive functions.
That “There is no such theory for cognitive ability” is even admitted by lead IQ-ist Ian Deary in his 2001 book Intelligence: A Very Short Introduction, in which he writes “There is no such thing as a theory of human intelligence differences—not in the way that grown-up sciences like physics or chemistry have theories” (Richardson, 2012). Thus, due to this, this is yet another barrier against IQ’s attempted validity, since there is no such thing as a theory of human intelligence.
In sum, neuroimaging meta-analyses (like Jung and Haier, 2007; see also Richardson and Norgate, 2007 in the same issue, pg 162-163) do not show what they purport to show for numerous reasons. (1) There are, of course, consequences of malnutrition for brain development and lower classes are more likely to not have their nutritional needs met (Ruxton and Kirk, 1996); (2) low classes are more likely to be exposed to substance abuse (Karriker-Jaffe, 2013), which may well impact brain regions; (3) “Stress arising from the poor sense of control over circumstances, including financial and workplace insecurity, affects children and leaves “an indelible impression on brain structure and function” (Teicher 2002, p. 68; cf. Austin et al. 2005)” (Richardson and Norgate, 2007: 163); and (4) working-class attitudes are related to poor self-efficacy beliefs, which also affect test performance (Richardson, 2002). So, Jung and Haier’s (2007) theory “merely redescribes the class structure and social history of society and its unfortunate consequences” (Richardson and Norgate, 2007: 163).
In regard to neuroimaging, pooling together (meta-analyzing) numerous studies is fraught with conceptual and methodological problems, since a high-degree of individual variability exists. Thus, attempting to find “average” brain differences in individuals fails, and the meta-analytic technique used (eg by Jung and Haier, 2007) fails to find what they want to find: average brain areas where, supposedly, cognition occurs between individuals. Meta-analyzing such disparate studies does not show an “average” where cognitive processes occur, and thusly, cause differences in IQ test-taking. Reductionist neuroimaging studies do not, as is popularly believed, pinpoint where cognitive processes take place in the brain, they have not been established and they may not even exist.
Nueroreductionism does not work; attempting to reduce cognitive processes to different regions of the brain, even using meta-analytic techniques as discussed here, fail. There “probably cannot” be neuroreductionist explanations for cognition (Uttal, 2014), and so, using these studies to attempt to pinpoint where in the brain—supposedly—cognition occurs for such ancillary things such as IQ test-taking fails. (Neuro)Reductionism fails.
Since there is no theory of individual differences in IQ, then they cannot be construct valid. Even if there were a theory of individual differences, IQ tests would still not be construct valid, since it would need to be established that there is a mechanistic relation between IQ tests and variable X. Attempts at validating IQ tests rely on correlations with other tests and older IQ tests—but that’s what is under contention, IQ validity, and so, correlating with older tests does not give the requisite validity to IQ tests to make the claim “IQ tests test intelligence” true. IQ does not even measure ability for complex cognition; real-life tasks are more complex than the most complex items on any IQ test (Richardson and Norgate, 2014b)
Now, having said all that, the argument can be formulated very simply:
Premise 1: If the claim “IQ tests test intelligence” is true, then IQ tests must be construct valid.
Premise 2: IQ tests are not construct valid.
Conclusion: Therefore, the claim “IQ tests test intelligence” is false. (modus tollens, P1, P2)
The word ‘construct’ is defined as “an idea or theory containing various conceptual elements, typically one considered to be subjective and not based on empirical evidence.” Whereas the word ‘validity’ is defined as “the quality of being logically or factually sound; soundness or cogency.” Is there construct validity for IQ tests? Are IQ tests tested against an idea or theory containing various conceptual elements? No, they are not.
Cronbach and Meehl (1955) define construct validity, which they state is “involved whenever a test is to be interpreted as a measure of some attribute or quality which is not “operationally defined.”” Though, the construct validity for IQ tests has been fleeting to investigators. Why? Because there is no theory of individual IQ differences to test IQ tests on. It is even stated that “there is no accepted unit of measurement for constructs and even fairly well-known ones, such as IQ, are open to debate.” The ‘fairly well-known ones’ like IQ are ‘open to debate’ because no such validity exists. The only ‘validity’ that exists for IQ tests is correlations with other tests and attempted correlations with job performance, but I will show that that is not construct validity as is classicly defined.
Construct validity can be easily defined as the ability of a test to measure the concept or construct that it is intended to measure. We know two things about IQ tests: 1) they do not test ‘intelligence’ (but they supposedly do a ‘good enough job’ so that it does not matter) and 2) it does not even test the ‘construct’ that it is intended to measure. For example, the math problem ‘1+1’ is construct valid regarding one’s knowledge and application of that math problem. Construct validity can pretty much be summed up as the proof that it is measuring what the test intends…but where is this proof? It is non-existent.
Richardson (1998: 116) writes:
Psychometrists, in the absence of such theoretical description, simply reduce score differences, blindly to the hypothetical construct of ‘natural ability’. The absence of descriptive precision about those constructs has always made validity estimation difficult. Consequently the crucial construct validity is rarely mentioned in test manuals. Instead, test designers have sought other kinds of evidence about the valdity of their tests.
The validity of new tests is sometimes claimed when performances on them correlate with performances on other, previously accepted, and currently used, tests. This is usually called the criterion validity of tests. The Stanford-Binet and the WISC are often used as the ‘standards’ in this respect. Whereas it may be reassuring to know that the new test appears to be measuring the same thing as an old favourite, the assumption here is that (construct) validity has already been demonstrated in the criterion test.
Some may attempt to say that, for instance, biological construct validity for IQ tests may be ‘brain size’, since brain size is correlated with IQ at .4 (meaning 16 percent of the variance in IQ is explained by brain size). However, for this to be true, someone with a larger brain would always have to be ‘more intelligent’ (whatever that means; score higher on an IQ test) than someone with a smaller brain. This is not true, so therefore brain size is not and should not be used as a measure of construct validity. Nisbett et al (2012: 144) address this:
Overall brain size does not plausibly account for differences in aspects of intelligence because all areas of the brain are not equally important for cognitive functioning.
For example, breathalyzer tests are construct valid. There is a .93 correlation (test-retest) between 1 ml/kg bodyweight of ethanol in 20, healthy male subjects. Furthermore, obtaining BAC through gas chromatography of venous blood, the two readings were highly correlated at .94 and .95 (Landauer, 1972). Landauer (1972: 253) writes “the very high accuracy and validity of breath analysis as a correct estimate of the BAL is clearly shown.” Construct validity exists for ad-libitum taste tests of alcohol in the laboratory (Jones et al, 2016).
There is a casual connection between what one breathes into the breathalyzer and his BAC that comes out of the breathalyzer and how much he had to drink. For example, for a male at a bodyweight of 160 pounds, 4 drinks would have him at a BAC of .09, which would make him unfit to drive. (‘One drink’ being 12 oz of beer, 5 oz of wine, or 1.25 oz of 80 proof liquor.) He drinks more, his BAC reading goes up. Someone is more ‘intelligent’ (scores higher on an IQ test), then what? The correlations obtained from so-called ‘more intelligent people’, like glucose consumption, brain evoked potentials, reaction time, nerve conduction velocity, etc have never been shown to determine higher ‘ability’ to score higher on IQ tests. That, too, would not even be construct validation for IQ tests, since there needs to be a measure showing why person A scored higher than person B, which needs to hold one hundred percent of the time.
Another good example of the construct validity of an unseen construct is white blood cell count. White blood cell count was “associated with current smoking status and COPD severity, and a risk factor for poor lung function, and quality of life, especially in non-currently smoking COPD patients. The WBC count can be used, as an easily measurable COPD biomarker” (Koo et al, 2017). In fact, the PRISA II test has white blood cell count in it, which is a construct valid test. Even elevated white blood cell count strongly predicts all-cause and cardiovascular mortality (Johnson et al, 2005). It is also an independent risk factor for coronary artery disease (Twig et al, 2012).
A good example of tests supposedly testing one thing but testing another is found here:
As an example, think about a general knowledge test of basic algebra. If a test is designed to assess knowledge of facts concerning rate, time, distance, and their interrelationship with one another, but test questions are phrased in long and complex reading passages, then perhaps reading skills are inadvertently being measured instead of factual knowledge of basic algebra.
Numerous constructs have validity—but not IQ tests. It is assumed that they test ‘intelligence’ even though an operational definition of intelligence is hard to come by. This is important, as if there cannot be an agreement on what is being tested, how will there be construct validity for said construct in question?
Richardson (2002) writes that Detterman and Sternberg sent out a questionnaire to a group of theorists which was similar to another questionnaire sent out decades earlier to see if there was an agreement on what ‘intelligence’ is. Twenty-five attributes of intelligence were mentioned. Only 3 were mentioned by more than 25 percent of the respondents, with about half mentioning ‘higher level components’, one quarter mentioned ‘executive processes’ while 29 percent mentioned ‘that which is valued by culture’. About one-third of the attributes were mentioned by less than 10 percent of the respondents with 8 percent of them answering that intelligence is ‘the ability to learn’. So if there is hardly any consensus on what IQ tests measure or what ‘intelligence’ is, then construct validity for IQ seems to be very far in the distance, almost unseeable, because we cannot even define the word, nor actually test it with a test that’s not constructed to fit the constructors’ presupposed notions.
Now, explaining the non-existent validity of IQ tests is very simple: IQ tests are purported to measure ‘g’ (whatever that is) and individual differences in test scores supposedly reflect individual differences in ‘g’. However, we cannot say that it is differences in ‘g’ that cause differences in individual test scores since there is no agreed-upon model or description of ‘g’ (Richardson, 2017: 84). Richardson (2017: 84) writes:
In consequence, all claims about the validity of IQ tests have been based on the assumption that other criteria, such as social rank or educational or occupational acheivement, are also, in effect, measures of intelligence. So tests have been constructed to replicate such ranks, as we have seen. Unfortunately, the logic is then reversed to declare that IQ tests must be measures of intelligence, because they predict school acheivement or future occupational level. This is not proper scientific validation so much as a self-fulfilling ordinance.
Construct validity for IQ does not exist (Richardson and Norgate, 2015), unlike construct validity for breathalyzers (Landauer, 1972) or white blood cell count as a disease proxy (Wu et al, 2013; Shah et al, 2017). So, if construct validity is non-existent, then that means that there is no measure for how well IQ tests measure what it’s ‘purported to measure’, i.e., how ‘intelligent’ one is over another because 1) the definition of ‘intelligence’ is ill-defined and 2) IQ tests are not validated against agreed-upon biological models, though some attempts have been made, though the evidence is inconsistent (Richardson and Norgate, 2015). For there to be true validity, evidence cannot be inconsistent; it needs to measure what it purports to measure 100 percent of the time. IQ tests are not calibrated against biological models, but against correlations with other tests that ‘purport’ to measure ‘intelligence’.
(Note: No, I am not saying that everyone is equal in ‘intelligence’ (whatever that is), nor am I stating that everyone has the same exact capacity. As I pointed out last week, just because I point out flaws in tests, it does not mean that I think that people have ‘equal ability’, and my example of an ‘athletic abilities’ test last week is apt to show that pointing out flawed tests does not mean that I deny individual differences in a ‘thing’ (though athletic abilities tests are much better with no assumptions like IQ tests have.))