Race and IQ : Stereotype Threat R.I.P.

When it comes to race and IQ debate, a popular theory proposed as an explanation of the persistent black-white IQ gap (of 1 SD, or 15 IQ points) is the so-called stereotype threat (ST). Briefly, ST creates anxiety among individuals who belong to the negatively stereotyped group. Supposedly, after confirming a negative stereotype as a self-characterization, the performance of this group (e.g., ethnic minorities, women…) on IQ tests will be artificially reduced.

A widely cited paper is from Steele and Aronson (1995). A fatal flaw that has gone unnoticed by the media is that the authors have found no difference between whites and blacks on the “no-threat condition” simply because the prior SAT scores were adjusted. As Sackett et al. (2004, p. 9) noted :

On Interpreting Stereotype Threat as Accounting for African American-White Differences on Cognitive Tests - Figure 1c

Figure 1C can be interpreted as follows: “In the sample studied, there are no differences between groups in prior SAT scores, as a result of the statistical adjustment. Creating stereotype threat produces a difference in scores; eliminating threat returns to the baseline condition of no difference.” This casts the work in a very different light: Rather than suggesting stereotype threat as the explanation for SAT differences, it suggests that the threat manipulation creates an effect independent of SAT differences.

Thus, rather than showing that eliminating threat eliminates the large score gap on standardized tests, the research actually shows something very different. Specifically, absent stereotype threat, the African American–White difference is just what one would expect based on the African American–White difference in SAT scores, whereas in the presence of stereotype threat, the difference is larger than would be expected based on the difference in SAT scores.

Suppose an examiner was telling them that the test does not matter. How hard will they try ? This is exactly the same problem arisen from Duckworth’s flawed study (2011) on motivation. When test-takers are told that they will earn more money by doing well on the test, they will obviously put much more effort than what they would have done otherwise even though their real abilities remain unchanged. This does not mean that women and ethnic minorities are anxious all the time or that their performance in school and job are constantly depressed over the course of their lifetime. The score differences arisen from stereotype threat experiments are situation-specific. It is irrelevant to g. This is what we would expect if ST is just driving the level of anxiety. Consider Jensen’s words (1998, pp. 514-515) here :

In fact, the phenomenon of stereotype threat can be explained in terms of a more general construct, test anxiety, which has been studied since the early days of psychometrics. [111a] Test anxiety tends to lower performance levels on tests in proportion to the degree of complexity and the amount of mental effort they require of the subject. The relatively greater effect of test anxiety in the black samples, who had somewhat lower SAT scores, than the white subjects in the Stanford experiments constitutes an example of the Yerkes-Dodson law. [111b] It describes the empirically observed nonlinear relationship between three variables: (1) anxiety (or drive) level, (2) task (or test) complexity and difficulty, and (3) level of test performance. According to the Yerkes-Dodson law, the maximal test performance occurs at decreasing levels of anxiety as the perceived complexity or difficulty level of the test increases (see Figure 12.14). If, for example, two groups, A and B, have the same level of test anxiety, but group A is higher than group B in the ability measured by the test (so group B finds the test more complex and difficult than does group A), then group B would perform less well than group A. The results of the Stanford studies, therefore, can be explained in terms of the Yerkes-Dodson law, without any need to postulate a racial group difference in susceptibility to stereotype threat or even a difference in the level of test anxiety. The outcome predicted by the Yerkes-Dodson law has been empirically demonstrated in large groups of college students who were either relatively high or relatively low in measured cognitive ability; increased levels of anxiety adversely affected the intelligence test performance of low-ability students (for whom the test was frustratingly difficult) but improved the level of performance of high-ability students (who experienced less difficulty). [111c]

This more general formulation of the stereotype threat hypothesis in terms of the Yerkes-Dodson law suggests other experiments for studying the phenomenon by experimentally manipulating the level of test difficulty and by equating the tests’ difficulty levels for the white and black groups by matching items for percent passing the item within each group. Groups of blacks and whites should also be matched on true-scores derived from g-loaded tests, since equating the groups statistically by means of linear covariance analysis (as was used in the Stanford studies) does not adequately take account of the nonlinear relationship between anxiety and test performance as a function of difficulty level.

What’s more, the ST theory tells us nothing about the direction of causality. ST researchers, like Steele and Aronson, are just assuming that ST depresses IQs among blacks. A question often neglected is how these negative stereotypes get started in the first place. These stereotypes have likely emerged after decades and decades of poor black performance at school and on job. Stereotypes, in general, have a base logic because they do not emerge without a chain of causality. There is no magic force behind it, generally. See Jussim et al. (2009).

A pernicious assumption is what Steele and Aronson (1995, p. 798) write here :

For African American students, the act of taking a test purported to measure intellectual ability may be enough to induce this threat. But we assume that this is most likely to happen when the test is also frustrating. It is frustration that makes the stereotype – as an allegation of inability – relevant to their performance and thus raises the possibility that they have an inability linked to their race. This is not to argue that the stereotype is necessarily believed; only that, in the face of frustration with the test, it becomes more plausible as a self-characterization and thereby more threatening to the self.

The last sentence is crystal clear. They interpret stereotype threat as an invisible force affecting ethnic minorities even when a member of a stereotyped group is not consciously aware of the threat. In other words, a magic spell, sorcery, witchcraft, evil spirit.

But how can ST theory explain the black-white IQ gap found in cognitive tests such as digit span or reaction time tasks ? To this matter, it is worth recalling Herrnstein and Murray’s discussion (1994, pp. 282-285) on the digit span and the reaction time task to understand why the pervasive effect of stereotype threat is hard to conceive :

The technical literature is again clear. In study after study of the leading tests, the hypothesis that the B/W difference is caused by questions with cultural content has been contradicted by the facts. [31] Items that the average white test taker finds easy relative to other items, the average black test taker does too; the same is true for items that the average white and black find difficult. … Here, we restrict ourselves to the conclusion: The B/W difference is wider on items that appear to be culturally neutral than on items that appear to be culturally loaded. […]

The first involves the digit span subtest, part of the widely used Wechsler intelligence tests. It has two forms: forward digit span, in which the subject tries to repeat a sequence of numbers in the order read to him, and backward digit span, in which the subject tries to repeat the sequence of numbers backward. The test is simple in concept, uses numbers that are familiar to everyone, and calls on no cultural information besides knowing numbers. The digit span is especially informative regarding test motivation not just because of the low cultural loading of the items but because the backward form is twice as g-loaded as the forward form, it is a much better measure of general intelligence. The reason is that reversing the numbers is mentally more demanding than repeating them in the heard order, as readers can determine for themselves by a little self-testing.

… Several psychometricians, led by Arthur Jensen, have been exploring the underlying nature of g by hypothesizing that neurologic processing speed is implicated, akin to the speed of the microprocessor in a computer. Smarter people process faster than less smart people. The strategy for testing the hypothesis is to give people extremely simple cognitive tasks – so simple that no conscious thought is involved – and to use precise timing methods to determine how fast different people perform these simple tasks. One commonly used apparatus involves a console with a semicircle of eight lights, each with a button next to it. In the middle of the console is the “home” button. At the beginning of each trial, the subject is depressing the home button with his finger. One of the lights in the semicircle goes on. The subject moves his finger to the button closest to the light, which turns it off. There are more complicated versions of the task … but none requires much thought, and everybody gets every trial “right.” The subject’s response speed is broken into two measurements: reaction time (RT), the time it takes the subject to lift his finger from the home button after a target light goes on, and movement time (MT), the time it takes to move the finger from just above the home button to the target button. [36]

… The consistent result of many studies is that white reaction time is faster than black reaction time, but black movement time is faster than white movement time. [39] One can imagine an unmotivated subject who thinks the reaction time test is a waste of time and does not try very hard. But the level of motivation, whatever it may be, seems likely to be the same for the measures of RT and MT. The question arises: How can one be unmotivated to do well during one split-second of a test but apparently motivated during the next split-second?

Suppose our society is so steeped in the conditions that produce test bias that people in disadvantaged groups underscore their cognitive abilties on all the items on tests, thereby hiding the internal evidence of bias. At the same time and for the same reasons, they underperform in school and on the job in relation to their true abilities, thereby hiding the external evidence. In other words, the tests may be biased against disadvantaged groups, but the traces of bias are invisible because the bias permeates all areas of the group’s performance […]

… First, the comments about the digit span and reaction time results apply here as well. How can this uniform background bias suppress black reaction time but not the movement time? How can it suppress performance on backward digit span more than forward digit span? Second, the hypothesis implies that many of the performance yardsticks in the society at large are not only biased, they are all so similar in the degree to which they distort the truth – in every occupation, every type of educational institution, every achievement measure, every performance measure – that no differential distortion is picked up by the data. Is this plausible?

Of course, not. It is obviously inconceivable that anxiety would affect some tasks (e.g., reaction time) without affecting the others (e.g., movement time). The heterogeneity of its effect casts doubt on the pervasiveness of ST.

Now, to deal more directly with the studies on ST, a meta-analysis by Stoet and Geary (2012, pp. 96-99) shows that ST effect with regard to women in mathematics is very weak. A finding of particular interest if that previous studies are deeply flawed because the preexisting differences in math scores, the outcome of interest, have been adjusted, which would create confounds. Among the 20 studies (see Table 1) aimed to replicate the original study by Spencer et al. (1999), “Stereotype Threat and Women’s Math Performance”, 11 succeeded in replicating the result, but for 8 of them the scores were adjusted for previous math scores. Only 3 of the 20 studies did replicate the study without adjusting for the previous scores.

Can Stereotype Threat Explain the Gender Gap in Mathematics Performance and Achievement - Figure 1

We calculated the model estimates using a random effects model (k = 19) with a restricted likelihood function (Viechtbauer, 2010). We found that for the adjusted data sets, there was a significant effect of stereotype threat on women’s mathematics performance (estimated mean effect size ± 1 SEM; -0.61 ± 0.11, p < .001), but this was not the case for the unadjusted data sets (-0.17 ± 0.10, p = .09). In other words, the moderator variable “adjustment” played a role; the residual heterogeneity after including the moderator variable equals τ² = 0.038 (±0.035), Qresidual (17) = 28.058, p = .04, Qmoderator (2) = 32.479, p < .001 (compared to τ² = 0.075 (±0.047), Q(18) = 43.095, p < .001 without a moderator), which means that 49% of the residual heterogeneity can be explained by including this moderator.

Mischaracterization of the Role of Stereotype Threat in the Gender Gap in Mathematics Performance

The available evidence suggests some women’s performance on mathematics tests can sometimes be negatively influenced by an implicit or explicit questioning of their mathematical competence, but the effect is not as robust as many seem to assume. This is in and of itself not a scientific problem, it simply means that we do not yet fully understand the intrapersonal (e.g., degree of identification with mathematics) and social mechanisms that produce the gender by threat interactions when they are found.

The issue is whether there has been a rush to judgment. It is possible that academic researchers understand that the phenomenon is unstable, with much left to be discovered, and it is reporters and other members of the popular press who have overinterpreted the scientific literature, and as a result mischaracterized the phenomenon as a well established cause of the gender differences in mathematics performance. No doubt this is indeed the case for many researchers but in many other cases there is undo optimism about the stability and generalizability of this phenomenon. An example of this idea can be found in a recent publication by Grand, Ryan, Schmitt, and Hmurovic (2011), which did not replicate the stereotype threat effect on women’s mathematics performance and downplayed the importance of replication as follows: “However, Nguyen and Ryan (2008) emphasized that the important question for future research in addressing this issue is not whether the results of the theory can be replicated consistently, as meta-analytic evidence across multiple comparison groups have clearly demonstrated its robustnes (p. 22).” We do not share this enthusiasm, given the just described failures to replicate the original effect unambiguously. […]

We think there are two possible reasons for the misrepresentation of the strength and robustness of the effect. On the one hand, we assume that there has simply been a cascading effect of researchers citing each other, rather than themselves critically reviewing evidence for the stereotype threat hypothesis. For example, if one influential author describes an effect as robust and stable, others might simply accept that as fact.

Another reason for the overenthusiastic support of the stereotype hypothesis is that many of the associated studies only assessed the presumably stigmatized group rather than including a control group. For example, Wraga, Helt, Jacobs, and Sullivan (2008) aimed to dispel the idea that women’s underperformance in mathematics relative to men might be related to biological factors and did so using functional MRI (fMRI). Their study only included female participants and concludes that “By simply altering the context of a spatial reasoning task to create a more positive message, women’s performance accuracy can be increased substantially through greater neural efficiency.” Krendl, Richeson, Kelley, and Heatherton (2008) carried out another fMRI study of stereotype threat in women only and concluded that “The present study sheds light on the specific neural mechanisms affected by stereotype threat, thereby providing considerable promise for developing methods to forestall its unintended consequences.” The results of these studies are interesting and potentially important, but in the absence of a male control group it is difficult, if not logically impossible, to draw conclusions about gender differences in performance. [Note 7 : If we would accept that a study merely with female participants would reveal something unique about women, one could make the same argument for any other group category unique to all participants. The mistake of lacking a control group becomes clearer if one would conclude that a study with people who all wear cloths says something unique about people wearing clothes. We can only draw such conclusions by including a control group (i.e., men in studies that aim to draw conclusions about women in comparison to men, or naked people in studies that aims to draw conclusions about people wearing clothes).] Some of these studies do not explicitly state that their findings tell us something about women (in comparison to men), but the focus on the gender of the female participants when describing the data, and the discussions within these studies about social policy imply that is exactly what the authors mean. At the very least, we cannot expect that the general public would understand this distinction.

It seems likely that the strong conclusions of these latter studies again lead to other studies claiming that there is strong support for women being disadvantaged by social stereotyping. For example, Derks, Inzlicht, and Kang (2008) cite the Krendl et al. (2008) study and amplify the stereotype threat hypothesis as follows: “The incremental benefit of the fMRI work here is in the ability to test a behavioral theory at the biological level of action, the brain. This can serve as a springboard for further theory-testing investigations. The fact that these results converge with the behavioral work of others provides consistency across different levels of analysis and organization, an important step toward the broad understanding of any complex phenomenon (p. 169).” And Stewart, Latu, Kawakami, and Myers (2010) cite the Krendl et al. (2008) article to support the following statement: “In 2005, Harvard President Lawrence Summers publicly suggested that innate gender differences were probably the primary reason for women’s underrepresentation in math and science domains. His remarks caused a stir in academic and nonacademic communities and are at odds with considerable research suggesting that women’s underperformance in math and science is linked to situational factors (p. 221).” The problem here is that although the neuroscientific studies are interesting and important in and of themselves, they do not inform us about whether women are at a disadvantage in comparison to men on the tasks used in these studies. This is critical if we are to fully assess the stereotype threat explanation of the gender gap in mathematics performance.

Finally, we also felt that there were some potential problems with the presentation and interpretation of data. There was often an incomplete description of results (e.g., only figures with no reported Ms or SDs), alpha values were relaxed when it matched the hypothesis, and the analyses of different representations of the same data, such as number correct and percent correct. Granted, it can be reasonable to explore different dependent measures, but it was sometimes the case that significant or marginally significant effects (e.g., percentage correct) were reported in the text and nonsignificant effects (of the same data) in a footnote. Moreover, there was no consistency in the dependent measure of choice, except that the significant one was highlighted in the text and abstract and the nonsignificant one placed in a footnote.

Conclusions and Outlook

We started our review with an overview of research on gender differences in mathematics performance and achievement. Based on the various large surveys on this topic, it seems reasonable to conclude that at least in the higher levels of performance, male mathematical achievers appear to outnumber female mathematical achievers. This is not only reflected in mathematics exams, but also in the number of jobs related to mathematics held by men and, for example, the prestigious Fields Medal for mathematical achievement, which has been won by men only since it was first awarded in 1936. While few researchers will deny that there are gender differences in mathematics achievement, the really interesting question is what factors contribute to these differences, especially given that it will be impossible to close the gender gap without understanding these factors. […]

We also discussed the extent to which existing literature has amplified the stereotype threat hypothesis such that uncritical reading of the literature would lead one to conclude, as many have, that the hypothesis is strongly supported. Given the many enthusiastic statements about the stereotype threat effect, one of the most surprising findings of our review was that there were only 21 studies (including the original) that compared mathematics performance of men and women who were randomly assigned to threat conditions. This seems to be quite a contrast to larger reviews, such as by Nguyen and Ryan (2008). We identified three main reasons for the difference. First, their review was very general, whereas ours focused on the stereotype threat explanation of the gender gap in mathematics performance only. Thus, the many articles that were about other groups that might be affected by stereotype threat were not included. Second, we only included studies that had a male control group. Third, we only included published studies. We believe that this is reasonable, because it is difficult to determine the scientific credibility of unpublished data. Furthermore, we do not think that a possible file drawer effect, which is the likelihood of missing articles that have not been published, would change our conclusion. More likely than not, unpublished studies would have found no differences between experimental conditions, although we can only speculate about this.

The last sentence is worth noting. In fact, this is exactly what Wicherts & de Haan (2009) have discussed in an unpublished meta-analysis :

Numerous laboratory experiments have been conducted to show that African Americans’ cognitive test performance suffers under stereotype threat, i.e., the fear of confirming negative stereotypes concerning one’s group. A meta-analysis of 55 published and unpublished studies of this effect shows clear signs of publication bias. The effect varies widely across studies, and is generally small. Although elite university undergraduates may underperform on cognitive tests due to stereotype threat, this effect does not generalize to non-adapted standardized tests, high-stakes settings, and less academically gifted test-takers. Stereotype threat cannot explain the difference in mean cognitive test performance between African Americans and European Americans.

What publication bias denotes is that when a study fails to find a positive impact of stereotype threat, the study is not published. Simply. But why is this scandal so surprising when it comes to race debate ? Suspicions raised by Stoet were justified.

Another line of attack is provided by Gottfredson (2000, p. 139). She says that if stereotype threat represents a significant source of racial differences in IQ scores, then “one should find, among other things, that mental tests generally underestimate Blacks’ later performance in school and work, that test results are sensitive to race and the emotional supportiveness of the tester but not to the mental complexity of the task, and that racial gaps in test scores rise and fall with changes in the racial climate. Accumulated research, however, reveals quite the opposite (e.g., Jensen, 1980, 1998)”. A further evidence is provided by Sackett et al. (2008, pp. 222-223). It was a common belief that when equating for IQ scores, black people would have outperformed whites on, say, scholastic achievement or performance, thus evidencing black underprediction. The empirical evidence shows otherwise. From Herrnstein & Murray (1994, Appendix 5, pp. 650-654), we read :

If for example, blacks do better in school than whites after choosing blacks and whites with equal test scores, we could say that the test was biased against blacks in academic prediction. Similarly, if they do better on the job after choosing blacks and whites with equal test scores, the test could be considered biased against blacks for predicting work performance. This way of demonstrating bias is tantamount to showing that the regression of outcomes on scores differs for the two groups. On a test biased against blacks, the regression intercept would be higher for blacks than whites, as illustrated in the graphic below. Test scores under these conditions would underestimate, or “underpredict,” the performance outcome of blacks. A randomly selected black and white with the same IQ (shown by the vertical broken line) would not have equal outcomes; the black would outperform the white (as shown by the horizontal broken lines). The test is therefore biased against blacks. On an unbiased test, the two regression lines would converge because they would have the same intercept (the point at which the regression line crosses the vertical axis).

The Bell Curve, 1994, Herrnstein and Murray (graph p. 650)

But the graphic above captures only one of the many possible manifestations of predictive bias. Suppose, for example, a test was less valid for blacks than for whites. [1] In regression terms, this would translate into a smaller coefficient (slope in these graphics), which could, in turn, be associated either with or without a difference in the intercept. The next figure illustrates a few hypothetical possibilities.

The Bell Curve, 1994, Herrnstein and Murray (graph p. 651)

All three black lines have the same low coefficient; they vary only in their intercepts. The gray line, representing whites, has a higher coefficient (therefore, the line is steeper). Begin with the lowest of the three black lines. Only at the very lowest predictor scores do blacks score higher than whites on the outcome measure. As the score on the predictor increases, whites with equivalent predictor scores have higher outcome scores. Here, the test bias is against whites, not blacks. For the intermediate black line, we would pick up evidence for test bias against blacks in the low range of test scores and bias against whites in the high range. The top black line, with the highest of the three intercepts, would accord with bias against blacks throughout the range, but diminishing in magnitude the higher the score.

Readers will quickly grasp that test scores can predict outcomes differently for members of different groups and that such differences may justify claims of test bias. So what are the facts? Do we see anything like the first of the two graphics in the data – a clear difference in intercepts, to the disadvantage of blacks taking the test? Or is the picture cloudier – a mixture of intercept and coefficient differences, yielding one sort of bias or another in different ranges of the test scores? When questions about data come up, cloudier and murkier is usually a safe bet. So let us start with the most relevant conclusion, and one about which there is virtual unanimity among students of the subject of predictive bias in testing: No one has found statistically reliable evidence of predictive bias against blacks, of the sort illustrated in the first graphic, in large, representative samples of blacks and whites, where cognitive ability tests are the predictor variable for educational achievement or job performance. In the notes, we list some of the larger aggregations of data and comprehensive analyses substantiating this conclusion. [2] We have found no modern, empirically based survey of the literature on test bias arguing that tests are predictively biased against blacks, although we have looked for them.

When we turn to the hundreds of smaller studies that have accumulated in the literature, we find examples of varying regression coefficients and intercepts, and predictive validities. This is a fundamental reason for focusing on syntheses of the literature. Smaller or unrepresentative individual studies may occasionally find test bias because of the statistical distortions that plague them. There are, for example, sampling and measurement errors, errors of recording, transcribing, and computing data, restrictions of range in both the predictor and outcome measurements, and predictor or outcome scales that are less valid than they might have been. [3] Given all the distorting sources of variation, lack of agreement across studies is the rule. […]

Insofar as the many individual studies show a pattern at all, it points to overprediction for blacks. More simply, this body of evidence suggests that IQ tests are biased in favor of blacks, not against them. The single most massive set of data bearing on this issue is the national sample of more than 645,000 school children conducted by sociologist James Coleman and his associates for their landmark examination of the American educational system in the mid-1960s. Coleman’s survey included a standardized test of verbal and nonverbal IQ, using the kinds of items that characterize the classic IQ test and are commonly thought to be culturally biased against blacks: picture vocabulary, sentence completion, analogies, and the like. The Coleman survey also included educational achievement measures of reading level and math level that are thought to be straightforward measures of what the student has learned. If IQ item are culturally biased against blacks, it could be predicted that a black student would do better on the achievement measures than the putative IQ measure would lead one to expect (this is the rationale behind the current popularity of steps to modify the SAT so that it focuses less on aptitude and more on measures of what has been learned). But the opposite occurred. Overall, black IQ scores overpredicted black academic achievement by .26 standard deviations. [6] …

A second major source of data suggesting that standardized tests overpredict black performance is the SAT. Colleges commonly compare the performance of freshmen, measured by grade point average, against the expectations of their performance as predicted by SAT scores. A literature review of studies that broke down these data by ethnic group revealed that SAT scores overpredicted freshman grades for blacks in fourteen of fifteen studies, by a median of .20 standard deviation. [7] In five additional studies where the ethnic classification was “minority” rather than specifically “black,” the SAT score overpredicted college performance in all five cases, by a median of .40 standard deviation. [8]

For job performance, the most thorough analysis is provided by the Hartigan Report, assessing the relationship between the General Aptitude Test Battery (GATB) and job performance measures. Out of seventy-two studies that were assembled for review, the white intercept was higher than the black intercept in sixty of them – that is, the GATB overpredicted black performance in sixty out of the seventy-two studies. [9] Of the twenty studies in which the intercepts were statistically significantly different (at the .01 level), the white intercept was greater than the black intercept in all twenty cases. [10]

These findings about overprediction apply to the ordinary outcome measures of academic and job performance. But it should also be noted that “overprediction” can be a misleading concept when it is applied to outcome measures for which the predictor (IQ, in our continuing example) has very low validity. Inasmuch as blacks and whites differ on average in their scores on some outcome that is not linked to the predictor, the more biased it will be against whites. Consider the next figure, constructed on the assumption that the predictor is nearly invalid and that the two groups differ on average in their outcome levels.

The Bell Curve, 1994, Herrnstein and Murray (graph p. 654)

This situation is relevant to some of the outcome measures discussed in Chapter 14, such as short-term male unemployment, where the black and white means are quite different, but IQ has little relationship to short-term unemployment for either whites or blacks. This figure was constructed assuming only that there are factors influencing outcomes that are not captured by the predictor, hence its low validity, resulting in the low slope of the parallel regression lines. [11] The intercepts differ, expressing the generally higher level of performance by whites compared to blacks that is unexplained by the predictor variable. If we knew what the missing predictive factors are, we could include them in the predictor, and the intercept difference would vanish – and so would the implication that the newly constituted predictor is biased against whites. What such results seem to be telling us is, first, that IQ tests are not predictively biased against blacks but, second, that IQ tests alone do not explain the observed black-white differences in outcomes. It therefore often looks as if the IQ test is biased against whites.

The absence of underprediction does not imply the absence of measurement bias, however (Wicherts and Millsap, 2009), because it tells us nothing about the probabilities of a particular group to obtain a given score. Indeed, measurement bias is revealed when two persons with the same latent abilities have different probabilities of attaining the same score on the test. Therefore, an observed score is considered measurement invariant (MI) when the observed score does not depend on the person’s group or ethnicity. To this matter, Wicherts et al. (2005) use the Multigroup Confirmatory Factor Analysis (MGCFA) to test whether stereotype threat implies factorial invariance. Briefly, measurement invariance does not hold with regard to ST experiments. See also StatSquatch (January 6, 2011).

It is interesting to note that they have critized the use of ANCOVA when controlling for prior test scores (like Steele and Aronson did), on the grounds that “stereotype threat may lower the regression weight of the dependent variable on the covariate in the stereotype threat condition, which violates regression weight homogeneity over all experimental cells” (p. 698). There is no reason to suppose, they say, that ST effect is homogeneous : “Higher SAT scores would imply higher domain identification and therefore stronger ST effects” (Wicherts, 2005).

Wicherts et al. (2005, p. 698) explain the MI model as follows :

We first look at the formal definition of measurement invariance (Mellenbergh, 1989), which is expressed in terms of the conditional distribution of manifest test scores Y [denoted by f(Y | )]. Measurement invariance with respect to v holds if:

f (Y| η, v) = f (Y| η),     (1)

(for all Y, η, v), where η denotes the scores on the latent variable (i.e., latent ability) underlying the manifest random variable Y (i.e., the measured variable), and v is a grouping variable, which defines the nature of groups (e.g., ethnicity, sex). Note that v may also represent groups in experimental cells such as those that differ with respect to the pressures of stereotype threat. Equality 1 holds if, and only if, Y and v are conditionally independent given the scores on the latent construct η (Lubke et al., 2003b; Meredith, 1993).

One important implication of this definition is that the expected value of Y given η and v should equal the expected value of Y given only η. In other words, if measurement invariance holds, then the expected test score of a person with a certain latent ability (i.e., η) is independent of group membership. Thus, if two persons of a different group have exactly the same latent ability, then they must have the same (expected) score on the test. Suppose v denotes sex and Y represents the scores on a test measuring mathematics ability. If measurement invariance holds, then test scores of male and female test takers depend solely on their latent mathematics ability (i.e., η)1 and not on their sex. Then, one can conclude that measurement bias with respect to sex is absent and that manifest test score differences in Y correctly reflect differences in latent ability between the sexes.

With regard to the first study analysed by Wicherts et al. (2005, pp. 703-705), the differential effect of ST is displayed as follows :

The measurement bias due to stereotype threat was related to the most difficult NA subtest. An interesting finding is that, because of stereotype threat, the factor loading of this subtest did not deviate significantly from zero. This change in factor loading suggests a non-uniform effect of stereotype threat. This is consistent with the third scenario discussed above (cf. Appendix B) and with the idea that stereotype threat effects are positively associated with latent ability (cf. Cullen et al., 2004). Such a scenario could occur if latent ability and domain identification are positively associated. This differential effect may have led low-ability (i.e., moderately identified) minority students to perform slightly better under stereotype threat (cf. Aronson et al., 1999), perhaps because of moderate arousal levels, whereas the more able (i.e., highly identified) minority students performed worse under stereotype threat. Such a differential effect is displayed graphically in Figure 5.

Stereotype Threat and Group Differences in Test Performance - A Question of Measurement Invariance - Figure 5

In their discussion of the first study (on Dutch minority students) they have examined, Wicherts et al. write : “The intelligence factor explains approximately 0.1% of the variance in the NA subtest, as opposed to 30% in the other groups. To put it differently, because of stereotype threat, the NA test has become completely worthless as a measure of intelligence in the minority group”. In conclusion (p. 711), the authors consider stereotype threat as a source of measurement bias. Therefore, stereotype threat does not affect real (i.e., latent) abilities.

However, constructs such as intelligence and mathematic ability are stable characteristics, and stereotype threat effects are presumably shortlived effects, depending on factors such as test difficulty (e.g., O’Brien & Crandall, 2003; Spencer et al., 1999). Furthermore, stereotype threat effects are often highly task specific. For instance, Seibt and Förster (2004) found that stereotype threat leads to a more cautious and less risky test-taking style (i.e., prevention focus), the effects of which depend on whether a particular task is speeded or not, or whether a task demands creative or analytical thinking (cf. Quinn & Spencer, 2001). In light of such task specificity, we view stereotype threat effects as test artifacts, resulting in measurement bias.

Then, Rushton and Jensen (2005, pp. 249-250; 2010, pp. 16-17) reviewed several studies contradicting the stereotype theory. No factor X, that is, an environmental or cultural variable (racism, stereotype and the like) that specifically targets one ethnic group, has been found.

Another way of answering the question is to compare their psychometric factor structures of kinship patterns, background variables, and subtest correlations. If there are minority-specific developmental processes [i.e., stereotype threat, race stigma, white racism, history of slavery, lowered expectations, heightened stress, etc.] arising from cultural background differences between the races at work, they should be reflected in the correlations between the background variables and the outcome measures. Rowe (1994; Rowe, Vazsonyi, & Flannery, 1994, 1995) examined this hypothesis in a series of studies using structural equation models. One study of six data sources compared cross-sectional correlational matrices (about 10 x 10) for a total of 8,528 Whites, 3,392 Blacks, 1,766 Hispanics, and 906 Asians (Rowe et al., 1994). These matrices contained both independent variables (e.g., home environment, peer characteristics) and developmental outcomes (e.g., achievement, delinquency). A LISREL goodness-of-fit test found each ethnic group’s covariance matrix equal to the matrix of the other groups. Not only were the Black and White matrices nearly identical, but they were as alike as the covariance matrices computed from random halves within either group. There were no distortions in the correlations between the background variables and the outcome measures that suggested any minority-specific developmental factor.

Another study examined longitudinal data on academic achievement (Rowe et al., 1995). Again, any minority-specific cultural processes affecting achievement should have produced different covariance structures among ethnic and racial groups. Correlations were computed between academic achievement and family environment measures in 565 full-sibling pairs from the National Longitudinal Survey of Youth, each tested at ages 6.6 and 9.0 years (White N = 296 pairs; Black N = 149 pairs; Hispanic N = 120 pairs). Each racial group was treated separately, yielding three 8 x 8 correlation matrices, which included age as a variable. Because LISREL analysis showed the matrices were equal across the three groups, there was no evidence of any special minority-specific developmental process affecting either base rates in academic achievement or any changes therein over time.

The series of studies conducted by Rowe (1994, pp. 408-410; 1995, pp. 35-38) actually show that the hypothesis of unique causal process responsible for the lower achievement in one ethnic group but not in the others does not hold. The contrary would imply that the pattern of correlations (1) between environment and achievement, (2) between siblings, (3) between different ages, should be distinct for that particular group (e.g., africans). The absence of a factor X is a definitive rejection of the assumption underlying the ST theory. SEM analyses further demonstrate (Rowe & Cleveland, 1996) that the genetic and environmental factors causing variation within group were also causing variation between groups, thus meaning that the default hypothesis is confirmed. All factor loadings (genetic as well as shared environment) being fixed, the black-white mean differences keep track of the factor loadings, denoting identical etiologies.

That the usual B-W IQ difference appears to have the same cause is further supported by Dolan (2000), Dolan & Hamaker (2001), Lubke et al. (2003). The B-W difference was not due to differing levels of difficulty (across groups) at the subtest level while the equality of factor loadings also holds. They demonstrate that B-W difference does not arise from measurement bias, implying that the within-group difference and the between-group difference with regard to cognitive abilities have the same cause. Here’s Lubke’s enlightening discussion :

4. MI implies that between-group differences cannot be due to other factors than those accounting for within-group differences

The statement that between-group differences are attributable to the same sources as within-group differences (or a subset thereof) is another way of saying that mean differences between groups cannot be due to other factors than the individual differences within each group. To confirm this statement, we have to show that two propositions are tenable by the usual statistical criteria: (1) that the same factors are measured in the model for the means as in the model for the covariances and (2) that the same factors are measured across groups.

The first part follows directly from the way the multigroup model has been derived. We have shown that the two parts of the multigroup model, the model for the means and the model for the covariances, have been deduced from the same regression equation (Eq. (1)). Eq. (1) specifies the relation between observed scores and underlying factors. To derive the multigroup model, we have taken the mean of Eq. (1) (as shown in Eq. (5)) and the variances and covariances (see Eq. (6)). Taking means and (co)variances does not change the relation between observed scores and their underlying factors as specified in Eq. (1). The factors in the model for the means are the same as in the model for the covariances because both submodels are derived from the same regression equation of observed variables on the factors.

The second part is implied by the concept of MI. The concept of MI has been developed by Meredith (1993) to provide the necessary and sufficient conditions to determine whether a set of observed items actually measures the same underlying factor(s) in several groups. MI states that the only difference between groups concerns the factor means and the factor covariances but not the relation of observed scores to their underlying factors. Only if the relation of an observed variable to an underlying factor differs across groups, one can argue that a ‘‘different factor’’ is measured in those groups. If Eq. (1) holds across groups with identical parameter values, with the understanding that the mean and the covariances of the factors, η in Eq. (1), may differ, then one can conclude that the proposition that same factors are measured across groups is tenable.

To illustrate our argument, we discuss two scenarios that show why differences in the sources of within- and between-group differences are inconsistent with MI. First, we discuss the case that all factors underlying between-group differences are different from the factors underlying within-group differences. Second, we consider a situation in which the within-group factors are a subset of the between-group factors, that is, the two types of factors coincide but there are additional between-group factors that do not play a role in explaining the within-group differences. In addition, we show that the case, where between-factors are a subset of the within-factors, is consistent with MI and that the modeling approach provides the means to test which of within-group factors does not contribute to the between-group differences.

Suppose observed mean differences between groups are due to entirely different factors than those that account for the individual differences within a group. The notion of ‘‘different factors’’ as opposed to ‘‘same factors’’ implies that the relation of observed variables and underlying factors is different in the model for the means as compared with the model for the covariances, that is, the pattern of factor loadings is different for the two parts of the model. If the loadings were the same, the factors would have the same interpretation. In terms of the multigroup model, different loadings imply that the matrix Λ in Eq. (9) differs from the matrix Λ in Eq. (10) (or Eqs. (5) and (6)). However, this is not the case in the MI model. Mean differences are modeled with the same loadings as the covariances. Hence, this model is inconsistent with a situation in which between-group differences are due to entirely different factors than within-group differences. In practice, the MI model would not be expected to fit because the observed mean differences cannot be reproduced by the product of α and the matrix of loadings, which are used to model the observed covariances. Consider a variation of the widely cited thought experiment provided by Lewontin (1974), in which between-group differences are in fact due to entirely different factors than individual differences within a group. The experiment is set up as follows. Seeds that vary with respect to the genetic make-up responsible for plant growth are randomly divided into two parts. Hence, there are no mean differences with respect to the genetic quality between the two parts, but there are individual differences within each part. One part is then sown in soil of high quality, whereas the other seeds are grown under poor conditions. Differences in growth are measured with variables such as height, weight, etc. Differences between groups in these variables are due to soil quality, while within-group differences are due to differences in genes. If an MI model were fitted to data from such an experiment, it would be very likely rejected for the following reason. Consider between-group differences first. The outcome variables (e.g., height and weight of the plants, etc.) are related in a specific way to the soil quality, which causes the mean differences between the two parts. Say that soil quality is especially important for the height of the plant. In the model, this would correspond to a high factor loading. Now consider the within-group differences. The relation of the same outcome variables to an underlying genetic factor are very likely to be different. For instance, the genetic variation within each of the two parts may be especially pronounced with respect to weight-related genes, causing weight to be the observed variable that is most strongly related to the underlying factor. The point is that a soil quality factor would have different factor loadings than a genetic factor, which means that Eqs. (9) and (10) cannot hold simultaneously. The MI model would be rejected.

In the second scenario, the within-factors are a subset of the between-factors. For instance, a verbal test is taken in two groups from neighborhoods that differ with respect to SES. Suppose further that the observed mean differences are partially due to differences in SES. Within groups, SES does not play a role since each of the groups is homogeneous with respect to SES. Hence, in the model for the covariances, we have only a single factor, which is interpreted in terms of verbal ability. To explain the between-group differences, we would need two factors, verbal ability and SES. This is inconsistent with the MI model because, again, in that model the matrix of factor loadings has to be the same for the mean and the covariance model. This excludes a situation in which loadings are zero in the covariance model and nonzero in the mean model.

As a last example, consider the opposite case where the between-factors are a subset of the within-factors. For instance, an IQ test measuring three factors is administered in two groups and the groups differ only with respect to two of the factors. As mentioned above, this case is consistent with the MI model. The covariances within each group result in a three-factor model. As a consequence of fitting a three-factor model, the vector with factor means, α in Eq. (9), contains three elements. However, only two of the element corresponding to the factors with mean group differences are nonzero. The remaining element is zero. In practice, the hypothesis that an element of α is zero can be investigated by inspecting the associated standard error or by a likelihood ratio test (see below).

In summary, the MI model is a suitable tool to investigate whether within- and between-group differences are due to the same factors. The model is likely to be rejected if the two types of differences are due to entirely different factors or if there are additional factors affecting between-group differences. Testing the hypothesis that only some of the within factors explain all between differences is straightforward. Tenability of the MI model provides evidence that measurement bias is absent and that, consequently, within- and between-group differences are due to factors with the same conceptual interpretation.

To the extent that the black-white IQ difference does not violate MI, with ST violating MI, we are left with the conclusion that race differences in IQ arise from common factors, that is, have the same causes. Ad hoc theories are not needed. To quote Gottfredson (2004), “According to social privilege theory, there would be no racial inequality in a fair, non-discriminatory society. The continuing existence of racial inequality is therefore proof of continuing discrimination” (p. 37) with race denialism being a social construct. Egalitarian ideology is not needed.

References :

  1. Dolan Conor. V., 2000, Investigating Spearman’s hypothesis by means of multi-group confirmatory factor analysis.
  2. Dolan Conor V., and Hamaker Ellen L., 2001, Investigating black–white differences in psychometric IQ: Multi-group confirmatory factor analysis of the WISC-R and K-ABC and a critique of the method of correlated vectors.
  3. Duckworth Angela Lee, Quinn Patrick D., Lynam Donald R., Loeber Rolf, and Stouthamer-Loeber Magda, 2011, Role of test motivation in intelligence testing.
  4. Gottfredson Linda S., 2000, Skills Gaps, Not Tests, Make Racial Proportionality Impossible.
  5. Gottfredson Linda S., 2004, Social Consequences of Group Differences in Cognitive Ability.
  6. Grand James A., Ryan Ann Marie, Schmitt Neal, and Hmurovic Jillian, 2011, How Far Does Stereotype Threat Reach? The Potential Detriment of Face Validity in Cognitive Ability Testing.
  7. Herrnstein Richard J., and Murray Charles, 1994, The Bell Curve: Intelligence and Class Structure in American Life, With a New Afterword by Charles Murray.
  8. Jensen Arthur R., 1998, The g Factor: The Science of Mental Ability.
  9. Jussim Lee, Cain Thomas R., Crawford Jarret T., Harber Kent, Cohen Florette, 2009, The Unbearable Accuracy of Stereotypes.
  10. Lubke Gitta H., Dolan Conor V., Kelderman Henk, Mellenbergh Gideon J., 2003, On the relationship between sources of within- and between-group differences and measurement invariance in the common factor model.
  11. Nguyen Hannah-Hanh D., and Ryan Ann Marie, 2008, Does Stereotype Threat Affect Test Performance of Minorities and Women? A Meta-Analysis of Experimental Evidence.
  12. Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1994, No More Than Skin Deep: Ethnic and Racial Similarity in Developmental Process.
  13. Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1995, Ethnic and Racial Similarity in Developmental Process: A Study of Academic Achievement.
  14. Rowe David C., and Cleveland Hobard H., 1996, Academic Achievement in Blacks and Whites: Are the Developmental Processes Similar?.
  15. Rushton J. Philippe, and Jensen Arthur R., 2005, Thirty Years of Research on Race Differences in Cognitive Ability.
  16. Rushton J. Philippe, and Jensen Arthur R., 2010, Race and IQ: A Theory-Based Review of the Research in Richard Nisbett’s Intelligence and How to Get It.
  17. Sackett Paul R., Hardison Chaitra M., and Cullen Michael J., 2004, On Interpreting Stereotype Threat as Accounting for African American-White Differences on Cognitive Tests.
  18. Sackett Paul R., Borneman Matthew J., and Connelly Brian S., 2008, High-Stakes Testing in Higher Education and Employment.
  19. Steele Claude M., and Aronson Joshua, 1995, Stereotype Threat and the Intellectual Test Performance of African Americans.
  20. Steele Claude M., and Aronson Joshua, 1998, Stereotype Threat and the Test Performance of Academically Successful African Americans, pp. 401-426, in The Black-White Test Score Gap, by Jencks Christopher and Phillips Meredith (1998).
  21. Stoet Gijsbert, and Geary David C., 2012, Can Stereotype Threat Explain the Gender Gap in Mathematics Performance and Achievement?.
  22. Wicherts Jelte M., 2005, Stereotype Threat Research and the Assumptions Underlying Analysis of Covariance.
  23. Wicherts Jelte M., Dolan Conor V., and Hessen David J., 2005, Stereotype Threat and Group Differences in Test Performance: A Question of Measurement Invariance.
  24. Wicherts Jelte M., and Millsap Roger E., 2009, The Absence of Underprediction Does Not Imply the Absence of Measurement Bias.
  25. Wicherts Jelte M., and de Haan Cor, 2009, unpublished, Stereotype threat and the cognitive test performance of African Americans.

3 comments on “Race and IQ : Stereotype Threat R.I.P.

  1. dianabuja says:

    Interesting blog – i’ve printed out and will go off to read more carefully. Working in Africa, the issue of race – beginning in colonial times – has so provoundly impacted locally and nationally, and often is confounded with ‘tribe’.
    On test anxiety – yes, a real issue, I suffered terribly in grad school both in written and oral exams.

  2. 猛虎 says:

    Stereotype threat and test anxiety are certainly real phenomena but I would say that it is situation specific, a variably within-group factor. It has no impact on latent ability, just the “observed” scores.

    • dianabuja says:

      Agreed. Certainly, once I sorted out my committee (with the help of my advosor) the final PhD orals were a breeze (so to speak). Not latent ability, but easily confused for that…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s