Quand il en vient au débat QI & Race, une théorie tristement populaire proposée comme une explication de la persistance de l’écart de QI entre les blancs et les noirs (environ 1 SD, ou 1 écart-type, soit 15 points de QI) est la théorie du Stereotype Threat (ST), ou menace du stéréotype. Brièvement, ST crée de l’anxiété parmi les individus qui appartiennent au groupe négativement stéréotypé, stigmatisé. Soi-disant, après s’être conformé à un stéréotype négatif, les performances du groupe affecté (ex, femmes, minorités ethniques…) sur les tests de QI seront artificiellement réduites.
Un papier largement cité est celui de Steele & Aronson (1995). Une erreur fatale qui est passée complètement inaperçue par les médias, c’est que les auteurs n’ont trouvé aucune différence entre les blancs et les noirs sur le “no-threat condition” pour la simple et bonne raison que les scores pré-test du SAT ont été ajustés, ce qui a biaisé le résultat. Comme Sackett et al. (2004, p. 9) l’ont noté :
Figure 1C can be interpreted as follows: “In the sample studied, there are no differences between groups in prior SAT scores, as a result of the statistical adjustment. Creating stereotype threat produces a difference in scores; eliminating threat returns to the baseline condition of no difference.” This casts the work in a very different light: Rather than suggesting stereotype threat as the explanation for SAT differences, it suggests that the threat manipulation creates an effect independent of SAT differences.
Thus, rather than showing that eliminating threat eliminates the large score gap on standardized tests, the research actually shows something very different. Specifically, absent stereotype threat, the African American–White difference is just what one would expect based on the African American–White difference in SAT scores, whereas in the presence of stereotype threat, the difference is larger than would be expected based on the difference in SAT scores.
Supposons qu’un examinateur explique aux sujets que le test n’est absolument pas important. Vont-ils y mettre beaucoup d’efforts ? C’est le même problème posée par l’étude biaisée de Duckworth (2011) sur la motivation. Si l’on apprend aux sujets qu’ils gagneront davantage d’argent en faisant un bon score au test de QI, ils y mettront bien plus d’efforts que ce qu’il en aurait été autrement, bien que leur niveau de QI réel reste inchangé. Cela ne signifie pas que les femmes et les minorités ethniques sont anxieux tout le temps ou que leurs performances à l’école et au travail seront déprimées tout au long de leur vie. Les différences de scores dérivées des expériences de stereotype threat sont spécifiques à une situation particulière. C’est totalement sans rapport avec g. C’est ce à quoi nous devrions nous attendre si les expériences ST ne font juste que varier le niveau d’anxiété. Considérons ici les mots de Jensen (1998, pp. 514-515) :
In fact, the phenomenon of stereotype threat can be explained in terms of a more general construct, test anxiety, which has been studied since the early days of psychometrics. [111a] Test anxiety tends to lower performance levels on tests in proportion to the degree of complexity and the amount of mental effort they require of the subject. The relatively greater effect of test anxiety in the black samples, who had somewhat lower SAT scores, than the white subjects in the Stanford experiments constitutes an example of the Yerkes-Dodson law. [111b] It describes the empirically observed nonlinear relationship between three variables: (1) anxiety (or drive) level, (2) task (or test) complexity and difficulty, and (3) level of test performance. According to the Yerkes-Dodson law, the maximal test performance occurs at decreasing levels of anxiety as the perceived complexity or difficulty level of the test increases (see Figure 12.14). If, for example, two groups, A and B, have the same level of test anxiety, but group A is higher than group B in the ability measured by the test (so group B finds the test more complex and difficult than does group A), then group B would perform less well than group A. The results of the Stanford studies, therefore, can be explained in terms of the Yerkes-Dodson law, without any need to postulate a racial group difference in susceptibility to stereotype threat or even a difference in the level of test anxiety. The outcome predicted by the Yerkes-Dodson law has been empirically demonstrated in large groups of college students who were either relatively high or relatively low in measured cognitive ability; increased levels of anxiety adversely affected the intelligence test performance of low-ability students (for whom the test was frustratingly difficult) but improved the level of performance of high-ability students (who experienced less difficulty). [111c]
This more general formulation of the stereotype threat hypothesis in terms of the Yerkes-Dodson law suggests other experiments for studying the phenomenon by experimentally manipulating the level of test difficulty and by equating the tests’ difficulty levels for the white and black groups by matching items for percent passing the item within each group. Groups of blacks and whites should also be matched on true-scores derived from g-loaded tests, since equating the groups statistically by means of linear covariance analysis (as was used in the Stanford studies) does not adequately take account of the nonlinear relationship between anxiety and test performance as a function of difficulty level.
Qui plus est, la théorie du stereotype threat (ST) ne nous dit rien sur le sens de la causalité. Les chercheurs en stereotype threat, comme Steele & Aronson, supposent simplement que le phénomène ST déprime le QI des noirs. Une question souvent négligée est de savoir comment les stéréotypes sont apparus en premier lieu. Ces stéréotypes surgissent plutôt après des décennies et des décennies d’observations sur les pauvres performances des noirs aussi bien à l’école qu’au travail. Les stéréotypes, en général, ont une logique de base dans la mesure où elles n’émergent pas sans une chaîne de causalité. Il n’y a pas de force magique derrière. Voir Jussim et al. (2009).
Une hypothèse pernicieuse est ce que Steele & Aronson (1995, p. 798) énoncent ici :
For African American students, the act of taking a test purported to measure intellectual ability may be enough to induce this threat. But we assume that this is most likely to happen when the test is also frustrating. It is frustration that makes the stereotype – as an allegation of inability – relevant to their performance and thus raises the possibility that they have an inability linked to their race. This is not to argue that the stereotype is necessarily believed; only that, in the face of frustration with the test, it becomes more plausible as a self-characterization and thereby more threatening to the self.
La dernière phrase est claire comme du crystal. Ils interprètent la menace du stéréotype comme une force invisible affectant les minorités ethniques quand bien même un membre du groupe stéréotypé n’en est même pas conscient. En d’autres termes, une incantation magique, sorcellerie, esprit maléfique.
Mais comment la théorie ST explique que la différence de QI entre les noirs et les blancs trouvée sur les tests d’intelligence comme l’empan de chiffres (digit span) ou les tâches de temps de réaction (reaction time tasks) ? À cette question, il est intéressant de se rappeler de la discussion de Herrnstein & Murray (1994, pp. 282-285) sur l’empan de chiffres et les tâches de temps de réaction afin de comprendre pourquoi l’effet omniprésent du stereotype threat est difficile à concevoir :
The technical literature is again clear. In study after study of the leading tests, the hypothesis that the B/W difference is caused by questions with cultural content has been contradicted by the facts.  Items that the average white test taker finds easy relative to other items, the average black test taker does too; the same is true for items that the average white and black find difficult. … Here, we restrict ourselves to the conclusion: The B/W difference is wider on items that appear to be culturally neutral than on items that appear to be culturally loaded. […]
The first involves the digit span subtest, part of the widely used Wechsler intelligence tests. It has two forms: forward digit span, in which the subject tries to repeat a sequence of numbers in the order read to him, and backward digit span, in which the subject tries to repeat the sequence of numbers backward. The test is simple in concept, uses numbers that are familiar to everyone, and calls on no cultural information besides knowing numbers. The digit span is especially informative regarding test motivation not just because of the low cultural loading of the items but because the backward form is twice as g-loaded as the forward form, it is a much better measure of general intelligence. The reason is that reversing the numbers is mentally more demanding than repeating them in the heard order, as readers can determine for themselves by a little self-testing.
… Several psychometricians, led by Arthur Jensen, have been exploring the underlying nature of g by hypothesizing that neurologic processing speed is implicated, akin to the speed of the microprocessor in a computer. Smarter people process faster than less smart people. The strategy for testing the hypothesis is to give people extremely simple cognitive tasks – so simple that no conscious thought is involved – and to use precise timing methods to determine how fast different people perform these simple tasks. One commonly used apparatus involves a console with a semicircle of eight lights, each with a button next to it. In the middle of the console is the “home” button. At the beginning of each trial, the subject is depressing the home button with his finger. One of the lights in the semicircle goes on. The subject moves his finger to the button closest to the light, which turns it off. There are more complicated versions of the task … but none requires much thought, and everybody gets every trial “right.” The subject’s response speed is broken into two measurements: reaction time (RT), the time it takes the subject to lift his finger from the home button after a target light goes on, and movement time (MT), the time it takes to move the finger from just above the home button to the target button. 
… The consistent result of many studies is that white reaction time is faster than black reaction time, but black movement time is faster than white movement time.  One can imagine an unmotivated subject who thinks the reaction time test is a waste of time and does not try very hard. But the level of motivation, whatever it may be, seems likely to be the same for the measures of RT and MT. The question arises: How can one be unmotivated to do well during one split-second of a test but apparently motivated during the next split-second?
Suppose our society is so steeped in the conditions that produce test bias that people in disadvantaged groups underscore their cognitive abilties on all the items on tests, thereby hiding the internal evidence of bias. At the same time and for the same reasons, they underperform in school and on the job in relation to their true abilities, thereby hiding the external evidence. In other words, the tests may be biased against disadvantaged groups, but the traces of bias are invisible because the bias permeates all areas of the group’s performance […]
… First, the comments about the digit span and reaction time results apply here as well. How can this uniform background bias suppress black reaction time but not the movement time? How can it suppress performance on backward digit span more than forward digit span? Second, the hypothesis implies that many of the performance yardsticks in the society at large are not only biased, they are all so similar in the degree to which they distort the truth – in every occupation, every type of educational institution, every achievement measure, every performance measure – that no differential distortion is picked up by the data. Is this plausible?
Bien sûr que non. Il est évidemment inconcevable que l’anxiété affecterait certaines tâches (ex, temps de réaction) sans affecter les autres (ex, temps de mouvement). L’hétérogénéité de ses effets jette un doute sur l’omniprésence de la menace du stéréotype.
Maintenant, pour répondre plus directement à la théorie de la menace du stéréotype, une méta-analyse de Stoet & Geary (2012, pp. 96-99) montre que les effets de ST sur la performance des femmes en mathématique est très faible. Une découverte particulièrement intéressante est que les précédentes études sont fondamentalement viciées car les différences préexistantes dans les scores de mathématiques, la variable d’intérêt, ont été ajustées, ce qui crée des confondants. Parmi les 20 études (voir Table 1) visant à répliquer l’étude originale de Spencer et al. (1999), “Stereotype Threat and Women’s Math Performance”, 11 ont réussi à répliquer le résultat, mais pour 8 d’entre elles les précédents scores en maths ont été ajustés. Seulement 3 des 20 études ont répliqué l’étude sans ajustement des précédents scores.
We calculated the model estimates using a random effects model (k = 19) with a restricted likelihood function (Viechtbauer, 2010). We found that for the adjusted data sets, there was a significant effect of stereotype threat on women’s mathematics performance (estimated mean effect size ± 1 SEM; -0.61 ± 0.11, p < .001), but this was not the case for the unadjusted data sets (-0.17 ± 0.10, p = .09). In other words, the moderator variable “adjustment” played a role; the residual heterogeneity after including the moderator variable equals τ² = 0.038 (±0.035), Qresidual (17) = 28.058, p = .04, Qmoderator (2) = 32.479, p < .001 (compared to τ² = 0.075 (±0.047), Q(18) = 43.095, p < .001 without a moderator), which means that 49% of the residual heterogeneity can be explained by including this moderator.
Mischaracterization of the Role of Stereotype Threat in the Gender Gap in Mathematics Performance
The available evidence suggests some women’s performance on mathematics tests can sometimes be negatively influenced by an implicit or explicit questioning of their mathematical competence, but the effect is not as robust as many seem to assume. This is in and of itself not a scientific problem, it simply means that we do not yet fully understand the intrapersonal (e.g., degree of identification with mathematics) and social mechanisms that produce the gender by threat interactions when they are found.
The issue is whether there has been a rush to judgment. It is possible that academic researchers understand that the phenomenon is unstable, with much left to be discovered, and it is reporters and other members of the popular press who have overinterpreted the scientific literature, and as a result mischaracterized the phenomenon as a well established cause of the gender differences in mathematics performance. No doubt this is indeed the case for many researchers but in many other cases there is undo optimism about the stability and generalizability of this phenomenon. An example of this idea can be found in a recent publication by Grand, Ryan, Schmitt, and Hmurovic (2011), which did not replicate the stereotype threat effect on women’s mathematics performance and downplayed the importance of replication as follows: “However, Nguyen and Ryan (2008) emphasized that the important question for future research in addressing this issue is not whether the results of the theory can be replicated consistently, as meta-analytic evidence across multiple comparison groups have clearly demonstrated its robustnes (p. 22).” We do not share this enthusiasm, given the just described failures to replicate the original effect unambiguously. […]
We think there are two possible reasons for the misrepresentation of the strength and robustness of the effect. On the one hand, we assume that there has simply been a cascading effect of researchers citing each other, rather than themselves critically reviewing evidence for the stereotype threat hypothesis. For example, if one influential author describes an effect as robust and stable, others might simply accept that as fact.
Another reason for the overenthusiastic support of the stereotype hypothesis is that many of the associated studies only assessed the presumably stigmatized group rather than including a control group. For example, Wraga, Helt, Jacobs, and Sullivan (2008) aimed to dispel the idea that women’s underperformance in mathematics relative to men might be related to biological factors and did so using functional MRI (fMRI). Their study only included female participants and concludes that “By simply altering the context of a spatial reasoning task to create a more positive message, women’s performance accuracy can be increased substantially through greater neural efficiency.” Krendl, Richeson, Kelley, and Heatherton (2008) carried out another fMRI study of stereotype threat in women only and concluded that “The present study sheds light on the specific neural mechanisms affected by stereotype threat, thereby providing considerable promise for developing methods to forestall its unintended consequences.” The results of these studies are interesting and potentially important, but in the absence of a male control group it is difficult, if not logically impossible, to draw conclusions about gender differences in performance. [Note 7 : If we would accept that a study merely with female participants would reveal something unique about women, one could make the same argument for any other group category unique to all participants. The mistake of lacking a control group becomes clearer if one would conclude that a study with people who all wear cloths says something unique about people wearing clothes. We can only draw such conclusions by including a control group (i.e., men in studies that aim to draw conclusions about women in comparison to men, or naked people in studies that aims to draw conclusions about people wearing clothes).] Some of these studies do not explicitly state that their findings tell us something about women (in comparison to men), but the focus on the gender of the female participants when describing the data, and the discussions within these studies about social policy imply that is exactly what the authors mean. At the very least, we cannot expect that the general public would understand this distinction.
It seems likely that the strong conclusions of these latter studies again lead to other studies claiming that there is strong support for women being disadvantaged by social stereotyping. For example, Derks, Inzlicht, and Kang (2008) cite the Krendl et al. (2008) study and amplify the stereotype threat hypothesis as follows: “The incremental benefit of the fMRI work here is in the ability to test a behavioral theory at the biological level of action, the brain. This can serve as a springboard for further theory-testing investigations. The fact that these results converge with the behavioral work of others provides consistency across different levels of analysis and organization, an important step toward the broad understanding of any complex phenomenon (p. 169).” And Stewart, Latu, Kawakami, and Myers (2010) cite the Krendl et al. (2008) article to support the following statement: “In 2005, Harvard President Lawrence Summers publicly suggested that innate gender differences were probably the primary reason for women’s underrepresentation in math and science domains. His remarks caused a stir in academic and nonacademic communities and are at odds with considerable research suggesting that women’s underperformance in math and science is linked to situational factors (p. 221).” The problem here is that although the neuroscientific studies are interesting and important in and of themselves, they do not inform us about whether women are at a disadvantage in comparison to men on the tasks used in these studies. This is critical if we are to fully assess the stereotype threat explanation of the gender gap in mathematics performance.
Finally, we also felt that there were some potential problems with the presentation and interpretation of data. There was often an incomplete description of results (e.g., only figures with no reported Ms or SDs), alpha values were relaxed when it matched the hypothesis, and the analyses of different representations of the same data, such as number correct and percent correct. Granted, it can be reasonable to explore different dependent measures, but it was sometimes the case that significant or marginally significant effects (e.g., percentage correct) were reported in the text and nonsignificant effects (of the same data) in a footnote. Moreover, there was no consistency in the dependent measure of choice, except that the significant one was highlighted in the text and abstract and the nonsignificant one placed in a footnote.
Conclusions and Outlook
We started our review with an overview of research on gender differences in mathematics performance and achievement. Based on the various large surveys on this topic, it seems reasonable to conclude that at least in the higher levels of performance, male mathematical achievers appear to outnumber female mathematical achievers. This is not only reflected in mathematics exams, but also in the number of jobs related to mathematics held by men and, for example, the prestigious Fields Medal for mathematical achievement, which has been won by men only since it was first awarded in 1936. While few researchers will deny that there are gender differences in mathematics achievement, the really interesting question is what factors contribute to these differences, especially given that it will be impossible to close the gender gap without understanding these factors. […]
We also discussed the extent to which existing literature has amplified the stereotype threat hypothesis such that uncritical reading of the literature would lead one to conclude, as many have, that the hypothesis is strongly supported. Given the many enthusiastic statements about the stereotype threat effect, one of the most surprising findings of our review was that there were only 21 studies (including the original) that compared mathematics performance of men and women who were randomly assigned to threat conditions. This seems to be quite a contrast to larger reviews, such as by Nguyen and Ryan (2008). We identified three main reasons for the difference. First, their review was very general, whereas ours focused on the stereotype threat explanation of the gender gap in mathematics performance only. Thus, the many articles that were about other groups that might be affected by stereotype threat were not included. Second, we only included studies that had a male control group. Third, we only included published studies. We believe that this is reasonable, because it is difficult to determine the scientific credibility of unpublished data. Furthermore, we do not think that a possible file drawer effect, which is the likelihood of missing articles that have not been published, would change our conclusion. More likely than not, unpublished studies would have found no differences between experimental conditions, although we can only speculate about this.
La dernière phrase est intéressante. En fait, c’est exactement ce que Wicherts & de Haan (2009) ont reconnu dans une méta-analyse non publiée :
Numerous laboratory experiments have been conducted to show that African Americans’ cognitive test performance suffers under stereotype threat, i.e., the fear of confirming negative stereotypes concerning one’s group. A meta-analysis of 55 published and unpublished studies of this effect shows clear signs of publication bias. The effect varies widely across studies, and is generally small. Although elite university undergraduates may underperform on cognitive tests due to stereotype threat, this effect does not generalize to non-adapted standardized tests, high-stakes settings, and less academically gifted test-takers. Stereotype threat cannot explain the difference in mean cognitive test performance between African Americans and European Americans.
Ce que le biais de publication dénote ici, c’est que lorsqu’une étude échoue à trouver un impact réel du stereotype threat, l’étude n’est pas publiée, tout simplement. Mais pourquoi ce scandale serait si surprenant quand il en vient au débat sur les races ? Les suspicions de Stoet était justifiées.
Une autre ligne d’attaque est proposée par Gottfredson (2000, p. 139). Elle fait valoir que si la menace du stéréotype représentait une source réelle des différences raciales dans les scores de QI, alors “one should find, among other things, that mental tests generally underestimate Blacks’ later performance in school and work, that test results are sensitive to race and the emotional supportiveness of the tester but not to the mental complexity of the task, and that racial gaps in test scores rise and fall with changes in the racial climate. Accumulated research, however, reveals quite the opposite (e.g., Jensen, 1980, 1998)”. Une preuve supplémentaire est fournie par Sackett et al. (2008, pp. 222-223). Maintenant, venant de Herrnstein & Murray (1994, Appendix 5, pp. 650-654), nous pouvons lire :
If for example, blacks do better in school than whites after choosing blacks and whites with equal test scores, we could say that the test was biased against blacks in academic prediction. Similarly, if they do better on the job after choosing blacks and whites with equal test scores, the test could be considered biased against blacks for predicting work performance. This way of demonstrating bias is tantamount to showing that the regression of outcomes on scores differs for the two groups. On a test biased against blacks, the regression intercept would be higher for blacks than whites, as illustrated in the graphic below. Test scores under these conditions would underestimate, or “underpredict,” the performance outcome of blacks. A randomly selected black and white with the same IQ (shown by the vertical broken line) would not have equal outcomes; the black would outperform the white (as shown by the horizontal broken lines). The test is therefore biased against blacks. On an unbiased test, the two regression lines would converge because they would have the same intercept (the point at which the regression line crosses the vertical axis).
Si donc, à scores identiques aux tests, les noirs étaient plus performants que les blancs à l’école ou au travail, le test serait biaisé contre les noirs. La régression des résultats sur les scores différeraient donc entre les groupes. Le graphique ci-dessus montre ce qui se passerait si un test était biaisé contre les noirs. En ce cas, l’intersection de régression ou l’intercept serait plus élevée pour les noirs que pour les blancs. Par exemple, les noirs seraient plus performants que les blancs à l’école lorsque le niveau de QI des deux groupes est maintenu constant.
But the graphic above captures only one of the many possible manifestations of predictive bias. Suppose, for example, a test was less valid for blacks than for whites.  In regression terms, this would translate into a smaller coefficient (slope in these graphics), which could, in turn, be associated either with or without a difference in the intercept. The next figure illustrates a few hypothetical possibilities.
All three black lines have the same low coefficient; they vary only in their intercepts. The gray line, representing whites, has a higher coefficient (therefore, the line is steeper). Begin with the lowest of the three black lines. Only at the very lowest predictor scores do blacks score higher than whites on the outcome measure. As the score on the predictor increases, whites with equivalent predictor scores have higher outcome scores. Here, the test bias is against whites, not blacks. For the intermediate black line, we would pick up evidence for test bias against blacks in the low range of test scores and bias against whites in the high range. The top black line, with the highest of the three intercepts, would accord with bias against blacks throughout the range, but diminishing in magnitude the higher the score.
Dans un scénario où le test de QI serait moins valide pour les noirs que pour les blancs, en termes de régression, nous obtiendrons un plus petit coefficient et la courbe sur le graphique serait moins inclinée. Cela pourrait être associé, à son tour, à une différence ou non dans l’intersection, à savoir l’endroit où la droite coupe l’ordonnée.
Un autre scénario est ainsi mis en évidence sur le graphique ci-dessus, les trois lignes représentant les noirs ont le même coefficient puisque l’inclinaison est identique. En revanche, l’intersection diffère. Mais la ligne grise représentant les blancs est plus inclinée. Commençons par la ligne noire inférieure. Seulement sur les plus faibles scores du prédicteur (e.g., QI), les noirs ont des scores plus élevés que les blancs sur les mesures de résultats (e.g., tests d’aptitudes scolaires). À mesure que les scores sur le prédicteur augmentent, les blancs ayant des scores équivalents auront des résultats plus élevés. Ici, le biais de test va contre les blancs, pas les noirs. Pour la ligne noire intermédiaire, le biais de test va contre les noirs dans la fourchette inférieure des scores aux tests et contre les blancs dans la fourchette supérieure. La ligne noire supérieure, la plus haute des trois intersections, serait quant à elle conforme à l’idée des biais contre les noirs dans toute la fourchette, mais ce biais faiblit en magnitude à mesure que le score sur le prédicteur augmente.
Maintenant, que nous apprend la littérature ? Généralement, il n’y a pas de biais. Les scores aux tests QI conduisent même à une sur-prédiction sur les mesures de résultats pour les noirs, à savoir que les noirs performent moins bien que les blancs sur ces mesures de réussite académique lorsque leur niveau de QI est maintenu constant.
Readers will quickly grasp that test scores can predict outcomes differently for members of different groups and that such differences may justify claims of test bias. So what are the facts? Do we see anything like the first of the two graphics in the data – a clear difference in intercepts, to the disadvantage of blacks taking the test? Or is the picture cloudier – a mixture of intercept and coefficient differences, yielding one sort of bias or another in different ranges of the test scores? When questions about data come up, cloudier and murkier is usually a safe bet. So let us start with the most relevant conclusion, and one about which there is virtual unanimity among students of the subject of predictive bias in testing: No one has found statistically reliable evidence of predictive bias against blacks, of the sort illustrated in the first graphic, in large, representative samples of blacks and whites, where cognitive ability tests are the predictor variable for educational achievement or job performance. In the notes, we list some of the larger aggregations of data and comprehensive analyses substantiating this conclusion.  We have found no modern, empirically based survey of the literature on test bias arguing that tests are predictively biased against blacks, although we have looked for them.
When we turn to the hundreds of smaller studies that have accumulated in the literature, we find examples of varying regression coefficients and intercepts, and predictive validities. This is a fundamental reason for focusing on syntheses of the literature. Smaller or unrepresentative individual studies may occasionally find test bias because of the statistical distortions that plague them. There are, for example, sampling and measurement errors, errors of recording, transcribing, and computing data, restrictions of range in both the predictor and outcome measurements, and predictor or outcome scales that are less valid than they might have been.  Given all the distorting sources of variation, lack of agreement across studies is the rule. […]
Insofar as the many individual studies show a pattern at all, it points to overprediction for blacks. More simply, this body of evidence suggests that IQ tests are biased in favor of blacks, not against them. The single most massive set of data bearing on this issue is the national sample of more than 645,000 school children conducted by sociologist James Coleman and his associates for their landmark examination of the American educational system in the mid-1960s. Coleman’s survey included a standardized test of verbal and nonverbal IQ, using the kinds of items that characterize the classic IQ test and are commonly thought to be culturally biased against blacks: picture vocabulary, sentence completion, analogies, and the like. The Coleman survey also included educational achievement measures of reading level and math level that are thought to be straightforward measures of what the student has learned. If IQ item are culturally biased against blacks, it could be predicted that a black student would do better on the achievement measures than the putative IQ measure would lead one to expect (this is the rationale behind the current popularity of steps to modify the SAT so that it focuses less on aptitude and more on measures of what has been learned). But the opposite occurred. Overall, black IQ scores overpredicted black academic achievement by .26 standard deviations.  …
A second major source of data suggesting that standardized tests overpredict black performance is the SAT. Colleges commonly compare the performance of freshmen, measured by grade point average, against the expectations of their performance as predicted by SAT scores. A literature review of studies that broke down these data by ethnic group revealed that SAT scores overpredicted freshman grades for blacks in fourteen of fifteen studies, by a median of .20 standard deviation.  In five additional studies where the ethnic classification was “minority” rather than specifically “black,” the SAT score overpredicted college performance in all five cases, by a median of .40 standard deviation. 
For job performance, the most thorough analysis is provided by the Hartigan Report, assessing the relationship between the General Aptitude Test Battery (GATB) and job performance measures. Out of seventy-two studies that were assembled for review, the white intercept was higher than the black intercept in sixty of them – that is, the GATB overpredicted black performance in sixty out of the seventy-two studies.  Of the twenty studies in which the intercepts were statistically significantly different (at the .01 level), the white intercept was greater than the black intercept in all twenty cases. 
These findings about overprediction apply to the ordinary outcome measures of academic and job performance. But it should also be noted that “overprediction” can be a misleading concept when it is applied to outcome measures for which the predictor (IQ, in our continuing example) has very low validity. Inasmuch as blacks and whites differ on average in their scores on some outcome that is not linked to the predictor, the more biased it will be against whites. Consider the next figure, constructed on the assumption that the predictor is nearly invalid and that the two groups differ on average in their outcome levels.
Le concept de sur-prédiction est néanmoins trompeur quand il est appliqué aux mesures de résultat pour lesquelles le prédicteur (ici, le QI) a une validité très faible. Dans la mesure où les noirs et les blancs diffèrent en moyenne dans leurs scores sur certains résultats non liés au prédicteur, il sera davantage biaisé contre les blancs. La figure ci-dessus considère ainsi un scénario où le prédicteur est clairement invalide et où les deux groupes diffèrent en moyenne dans leurs niveaux de résultat. Dans ce graphique, les intersections sont différentes, et qu’elles soient plus élevées pour les blancs indiquent une meilleure performance des blancs par rapport aux noirs, une différence qui n’est pas expliquée par la variable explicative. Dans ce scénario, les tests de QI apparaissent comme biaisés contre les blancs, pas les noirs. Si les facteurs prédictifs manquants sont identifiés et inclus dans le prédicteur, la différence dans l’intersection disparaîtrait.
This situation is relevant to some of the outcome measures discussed in Chapter 14, such as short-term male unemployment, where the black and white means are quite different, but IQ has little relationship to short-term unemployment for either whites or blacks. This figure was constructed assuming only that there are factors influencing outcomes that are not captured by the predictor, hence its low validity, resulting in the low slope of the parallel regression lines.  The intercepts differ, expressing the generally higher level of performance by whites compared to blacks that is unexplained by the predictor variable. If we knew what the missing predictive factors are, we could include them in the predictor, and the intercept difference would vanish – and so would the implication that the newly constituted predictor is biased against whites. What such results seem to be telling us is, first, that IQ tests are not predictively biased against blacks but, second, that IQ tests alone do not explain the observed black-white differences in outcomes. It therefore often looks as if the IQ test is biased against whites.
L’absence de sous-prédiction envers les noirs ou autres minorités ne souffre pas de contestation. Mais l’absence de sous-prédiction n’implique pas nécessairement l’absence de biais de mesure (Wicherts & Millsap, 2009), parce que cela ne nous apprend rien sur les probabilités d’un groupe particulier à obtenir un score donné. En fait, le biais de mesure est révélé lorsque deux personnes ayant les mêmes capacités latentes (sur un facteur latent, e.g., verbal, arithmétique, etc.) ont différentes probabilités d’atteindre le même score sur le(s) sous-test(s). Concrètement, cela se traduirait par une différence dans les intercepts, ce qui signifierait que dans le cas de deux personnes aux capacités latentes identiques, la saturation d’un (ou plusieurs) sous-test(s) sur ce facteur latent n’est pas lié proportionnellement aux différences standardisées sur ce(s) sous-test(s) (Wicherts & Dolan, 2010, Figure 3). Dans le cas contraire, un score observé est considéré mesure invariant (‘measurement invariant’, ou MI) lorsque le score observé ne dépend pas de l’ethnicité ou du groupe de la personne, puisque les différences en facteur scores latent sont alors dues uniquement à des différences dans les moyennes des facteurs. En ce cas, nous constaterions une invariance en saturation des facteurs (factor loadings) et intercepts de mesure (measurement intercepts) au regard des deux groupes. Si les modèles à égalité de contraintes sur les intercepts de mesure indiquent un pauvre ou faible ajustement du model, la conclusion en serait que les différences dans les scores observés ne sont pas des différences dans les scores latents. À cet effet, Wicherts et al. (2005) utilisent le Multigroup Confirmatory Factor Analysis (MGCFA) pour tester si oui ou non la menace du stéréotype implique l’invariance factorielle. Brièvement, l’invariance de mesure est intenable au regard des expériences de menace du stéréotype.
Il est intéressant de noter que les auteurs ont critiqué l’utilisation de l’ANCOVA, i.e., l’analyse de la covariance, lorsque les précédents scores sont contrôlés (comme Steele & Aronson ont fait), sous prétexte que “stereotype threat may lower the regression weight (poids de régression ou coefficient de régression) of the dependent variable on the covariate in the stereotype threat condition, which violates regression weight homogeneity over all experimental cells” (p. 698). Il n’y a aucune raison de supposer, selon eux, que les effets de la menace du stéréotype soient homogènes : “Higher SAT scores would imply higher domain identification and therefore stronger ST effects” (Wicherts, 2005). En ce cas, les différences de scores constatées sous les expériences de menace du stéréotype seraient plutôt attribuées à des spécificités liées au test ou la façon dont celui-ci a été administrée, mais pas à des différences dans les capacités intellectuelles générales.
Wicherts et al. (2005, p. 698) expliquent le modèle MI comme suit :
We first look at the formal definition of measurement invariance (Mellenbergh, 1989), which is expressed in terms of the conditional distribution of manifest test scores Y [denoted by f(Y | )]. Measurement invariance with respect to v holds if:
f (Y| η, v) = f (Y| η), (1)
(for all Y, η, v), where η denotes the scores on the latent variable (i.e., latent ability) underlying the manifest random variable Y (i.e., the measured variable), and v is a grouping variable, which defines the nature of groups (e.g., ethnicity, sex). Note that v may also represent groups in experimental cells such as those that differ with respect to the pressures of stereotype threat. Equality 1 holds if, and only if, Y and v are conditionally independent given the scores on the latent construct η (Lubke et al., 2003b; Meredith, 1993).
One important implication of this definition is that the expected value of Y given η and v should equal the expected value of Y given only η. In other words, if measurement invariance holds, then the expected test score of a person with a certain latent ability (i.e., η) is independent of group membership. Thus, if two persons of a different group have exactly the same latent ability, then they must have the same (expected) score on the test. Suppose v denotes sex and Y represents the scores on a test measuring mathematics ability. If measurement invariance holds, then test scores of male and female test takers depend solely on their latent mathematics ability (i.e., η) and not on their sex. Then, one can conclude that measurement bias with respect to sex is absent and that manifest test score differences in Y correctly reflect differences in latent ability between the sexes.
Concernant la première étude analysée par Wicherts et al. (2005, pp. 703-705), l’effet différentiel du ST en une non-invariance de saturation des facteurs, ce qui indique la présence d’interaction entre l’effet du stéréotype et les capacités latentes sur le test. Dit autrement, l’effet ST dépend du QI (latent) du sujet, dû aux différences individuelles dans l’identification du domaine.
The measurement bias due to stereotype threat was related to the most difficult NA subtest. An interesting finding is that, because of stereotype threat, the factor loading of this subtest did not deviate significantly from zero. This change in factor loading suggests a non-uniform effect of stereotype threat. This is consistent with the third scenario discussed above (cf. Appendix B) and with the idea that stereotype threat effects are positively associated with latent ability (cf. Cullen et al., 2004). Such a scenario could occur if latent ability and domain identification are positively associated. This differential effect may have led low-ability (i.e., moderately identified) minority students to perform slightly better under stereotype threat (cf. Aronson et al., 1999), perhaps because of moderate arousal levels, whereas the more able (i.e., highly identified) minority students performed worse under stereotype threat. Such a differential effect is displayed graphically in Figure 5.
Dans leur discussion sur la première étude (sur les étudiants minoritaires aux Pays-Bas) qu’ils ont examiné, Wicherts et al. écrivent : “The intelligence factor explains approximately 0.1% of the variance in the NA subtest, as opposed to 30% in the other groups. To put it differently, because of stereotype threat, the NA test has become completely worthless as a measure of intelligence in the minority group”. En conclusion (p. 711), les auteurs considèrent la menace du stéréotype comme une source de biais de mesure. Par conséquent, la menace du stéréotype n’affecte pas les capacités latentes.
However, constructs such as intelligence and mathematic ability are stable characteristics, and stereotype threat effects are presumably short-lived effects, depending on factors such as test difficulty (e.g., O’Brien & Crandall, 2003; Spencer et al., 1999). Furthermore, stereotype threat effects are often highly task specific. For instance, Seibt and Förster (2004) found that stereotype threat leads to a more cautious and less risky test-taking style (i.e., prevention focus), the effects of which depend on whether a particular task is speeded or not, or whether a task demands creative or analytical thinking (cf. Quinn & Spencer, 2001). In light of such task specificity, we view stereotype threat effects as test artifacts, resulting in measurement bias.
En outre, Rushton et Jensen (2005, pp. 249-250; 2010, pp. 16-17) ont passé en revue plusieurs études contredisant la théorie du stéréotype. Aucun facteur X, c’est-à-dire, une variable culturelle ou environnementale (racisme, stéréotype, etc.) ciblant spécifiquement un groupe ethnique, n’a été trouvé.
Another way of answering the question is to compare their psychometric factor structures of kinship patterns, background variables, and subtest correlations. If there are minority-specific developmental processes [i.e., stereotype threat, race stigma, white racism, history of slavery, lowered expectations, heightened stress, etc.] arising from cultural background differences between the races at work, they should be reflected in the correlations between the background variables and the outcome measures. Rowe (1994; Rowe, Vazsonyi, & Flannery, 1994, 1995) examined this hypothesis in a series of studies using structural equation models. One study of six data sources compared cross-sectional correlational matrices (about 10 x 10) for a total of 8,528 Whites, 3,392 Blacks, 1,766 Hispanics, and 906 Asians (Rowe et al., 1994). These matrices contained both independent variables (e.g., home environment, peer characteristics) and developmental outcomes (e.g., achievement, delinquency). A LISREL goodness-of-fit test found each ethnic group’s covariance matrix equal to the matrix of the other groups. Not only were the Black and White matrices nearly identical, but they were as alike as the covariance matrices computed from random halves within either group. There were no distortions in the correlations between the background variables and the outcome measures that suggested any minority-specific developmental factor.
Another study examined longitudinal data on academic achievement (Rowe et al., 1995). Again, any minority-specific cultural processes affecting achievement should have produced different covariance structures among ethnic and racial groups. Correlations were computed between academic achievement and family environment measures in 565 full-sibling pairs from the National Longitudinal Survey of Youth, each tested at ages 6.6 and 9.0 years (White N = 296 pairs; Black N = 149 pairs; Hispanic N = 120 pairs). Each racial group was treated separately, yielding three 8 x 8 correlation matrices, which included age as a variable. Because LISREL analysis showed the matrices were equal across the three groups, there was no evidence of any special minority-specific developmental process affecting either base rates in academic achievement or any changes therein over time.
Similarly, Carretta and Ree (1995) examined the more specialized and diverse Air Force Officer Qualifying Test, a multiple-aptitude battery that had been given to 269,968 applicants (212,238 Whites, 32,798 Blacks, 12,647 Hispanics, 9,460 Asian Americans, and 2,551 Native Americans). The g factor accounted for the greatest amount of variance in all groups, and its loadings differed little by ethnicity. Thus, the factor structure of cognitive ability is nearly identical for Blacks and for Whites, as was found in the studies by Owen (1992) and Rushton and Skuy (2000; Rushton et al., 2002, 2003) comparing Africans, East Indians, and Whites on the item structures of tests described in Section 3. There was no “Factor X” specific to race.
La série d’études effectuées par Rowe (1994, pp. 408-410; 1995, pp. 35-38) montrent en vérité que l’hypothèse d’un processus causal unique responsable de la faible performance d’un groupe ethnique particulier est intenable. Le contraire impliquerait que le schéma des corrélations (1) entre l’environnement et la réussite, (2) entre les frères et soeurs, (3) entre différents âges, devrait être distinct pour ce groupe particulier (ex, les africains). L’absence d’un facteur X est un rejet définitif de l’hypothèse sous-jacente de la théorie du stereotype threat.
Que l’habituel Black-White IQ difference apparaît comme ayant une même cause reçoit un soutien supplémentaire par Dolan (2000), Dolan & Hamaker (2001), Lubke et al. (2003). Ils démontrent que les différences de QI entre les blancs et les noirs ne surviennent pas d’un biais de mesure, ce qui implique que les différences ‘au sein’ d’un groupe ethnique et les différences ‘entre’ les groupes ethniques concernant les capacités cognitives dérivent d’un facteur commun. Voici la discussion de Lubke :
4. MI implies that between-group differences cannot be due to other factors than those accounting for within-group differences
The statement that between-group differences are attributable to the same sources as within-group differences (or a subset thereof) is another way of saying that mean differences between groups cannot be due to other factors than the individual differences within each group. To confirm this statement, we have to show that two propositions are tenable by the usual statistical criteria: (1) that the same factors are measured in the model for the means as in the model for the covariances and (2) that the same factors are measured across groups.
The first part follows directly from the way the multigroup model has been derived. We have shown that the two parts of the multigroup model, the model for the means and the model for the covariances, have been deduced from the same regression equation (Eq. (1)). Eq. (1) specifies the relation between observed scores and underlying factors. To derive the multigroup model, we have taken the mean of Eq. (1) (as shown in Eq. (5)) and the variances and covariances (see Eq. (6)). Taking means and (co)variances does not change the relation between observed scores and their underlying factors as specified in Eq. (1). The factors in the model for the means are the same as in the model for the covariances because both submodels are derived from the same regression equation of observed variables on the factors.
The second part is implied by the concept of MI. The concept of MI has been developed by Meredith (1993) to provide the necessary and sufficient conditions to determine whether a set of observed items actually measures the same underlying factor(s) in several groups. MI states that the only difference between groups concerns the factor means and the factor covariances but not the relation of observed scores to their underlying factors. Only if the relation of an observed variable to an underlying factor differs across groups, one can argue that a ‘‘different factor’’ is measured in those groups. If Eq. (1) holds across groups with identical parameter values, with the understanding that the mean and the covariances of the factors, η in Eq. (1), may differ, then one can conclude that the proposition that same factors are measured across groups is tenable.
To illustrate our argument, we discuss two scenarios that show why differences in the sources of within- and between-group differences are inconsistent with MI. First, we discuss the case that all factors underlying between-group differences are different from the factors underlying within-group differences. Second, we consider a situation in which the within-group factors are a subset of the between-group factors, that is, the two types of factors coincide but there are additional between-group factors that do not play a role in explaining the within-group differences. In addition, we show that the case, where between-factors are a subset of the within-factors, is consistent with MI and that the modeling approach provides the means to test which of within-group factors does not contribute to the between-group differences.
Suppose observed mean differences between groups are due to entirely different factors than those that account for the individual differences within a group. The notion of ‘‘different factors’’ as opposed to ‘‘same factors’’ implies that the relation of observed variables and underlying factors is different in the model for the means as compared with the model for the covariances, that is, the pattern of factor loadings is different for the two parts of the model. If the loadings were the same, the factors would have the same interpretation. In terms of the multigroup model, different loadings imply that the matrix Λ in Eq. (9) differs from the matrix Λ in Eq. (10) (or Eqs. (5) and (6)). However, this is not the case in the MI model. Mean differences are modeled with the same loadings as the covariances. Hence, this model is inconsistent with a situation in which between-group differences are due to entirely different factors than within-group differences. In practice, the MI model would not be expected to fit because the observed mean differences cannot be reproduced by the product of α and the matrix of loadings, which are used to model the observed covariances. Consider a variation of the widely cited thought experiment provided by Lewontin (1974), in which between-group differences are in fact due to entirely different factors than individual differences within a group. The experiment is set up as follows. Seeds that vary with respect to the genetic make-up responsible for plant growth are randomly divided into two parts. Hence, there are no mean differences with respect to the genetic quality between the two parts, but there are individual differences within each part. One part is then sown in soil of high quality, whereas the other seeds are grown under poor conditions. Differences in growth are measured with variables such as height, weight, etc. Differences between groups in these variables are due to soil quality, while within-group differences are due to differences in genes. If an MI model were fitted to data from such an experiment, it would be very likely rejected for the following reason. Consider between-group differences first. The outcome variables (e.g., height and weight of the plants, etc.) are related in a specific way to the soil quality, which causes the mean differences between the two parts. Say that soil quality is especially important for the height of the plant. In the model, this would correspond to a high factor loading. Now consider the within-group differences. The relation of the same outcome variables to an underlying genetic factor are very likely to be different. For instance, the genetic variation within each of the two parts may be especially pronounced with respect to weight-related genes, causing weight to be the observed variable that is most strongly related to the underlying factor. The point is that a soil quality factor would have different factor loadings than a genetic factor, which means that Eqs. (9) and (10) cannot hold simultaneously. The MI model would be rejected.
In the second scenario, the within-factors are a subset of the between-factors. For instance, a verbal test is taken in two groups from neighborhoods that differ with respect to SES. Suppose further that the observed mean differences are partially due to differences in SES. Within groups, SES does not play a role since each of the groups is homogeneous with respect to SES. Hence, in the model for the covariances, we have only a single factor, which is interpreted in terms of verbal ability. To explain the between-group differences, we would need two factors, verbal ability and SES. This is inconsistent with the MI model because, again, in that model the matrix of factor loadings has to be the same for the mean and the covariance model. This excludes a situation in which loadings are zero in the covariance model and nonzero in the mean model.
As a last example, consider the opposite case where the between-factors are a subset of the within-factors. For instance, an IQ test measuring three factors is administered in two groups and the groups differ only with respect to two of the factors. As mentioned above, this case is consistent with the MI model. The covariances within each group result in a three-factor model. As a consequence of fitting a three-factor model, the vector with factor means, α in Eq. (9), contains three elements. However, only two of the element corresponding to the factors with mean group differences are nonzero. The remaining element is zero. In practice, the hypothesis that an element of α is zero can be investigated by inspecting the associated standard error or by a likelihood ratio test (see below).
In summary, the MI model is a suitable tool to investigate whether within- and between-group differences are due to the same factors. The model is likely to be rejected if the two types of differences are due to entirely different factors or if there are additional factors affecting between-group differences. Testing the hypothesis that only some of the within factors explain all between differences is straightforward. Tenability of the MI model provides evidence that measurement bias is absent and that, consequently, within- and between-group differences are due to factors with the same conceptual interpretation.
Si l’on considère par conséquent l’analogie de la plante de Lewontin, où deux groupes de plantes grandissent dans des endroits différents, l’un dans un sol riche, l’autre dans un désert, avec les différences de croissance étant dues aux différences des gènes au sein des groupes et aux différences dans la qualité du sol entre les groupes, alors l’invariance de mesure sera rejetée.
Entre les groupes, les variables de résultat (comme la taille et le poids des plantes) sont liées à la qualité du sol, ce qui montre son importance dans la croissance des plantes. Dans le modèle de l’invariance de mesure, cela correspond à une saturation (“factor loadings”) élevée. Mais au sein même des groupes, la relation entre les mêmes variables de résultat et un facteur génétique sous-jacent sera différente. La variation génétique au sein de chacun des deux groupes peut être particulièrement prononcée concernant les gènes liés au poids, ce qui ferait donc du poids la variable observée comme étant la plus fortement liée au facteur sous-jacent. Le facteur de qualité du sol aura des saturations différentes par rapport au facteur génétique. L’invariance de mesure sera transgressée.
Dans la mesure où le fameux Black-White IQ gap ne transgresse pas MI, avec ST transgressant MI, nous en arrivons à la conclusion que les différences de QI entre populations dérivent de facteurs communs, c’est-à-dire, ont les mêmes causes. Les théories ad hoc ne sont pas nécessaires. Pour citer Gottfredson (2004), “According to social privilege theory, there would be no racial inequality in a fair, non-discriminatory society. The continuing existence of racial inequality is therefore proof of continuing discrimination” (p. 37) avec le déni racial comme étant une construction sociale. L’idéologie n’est pas nécessaire.
- Dolan Conor. V., 2000, Investigating Spearman’s hypothesis by means of multi-group confirmatory factor analysis.
- Dolan Conor V., and Hamaker Ellen L., 2001, Investigating black–white differences in psychometric IQ: Multi-group confirmatory factor analysis of the WISC-R and K-ABC and a critique of the method of correlated vectors.
- Duckworth Angela Lee, Quinn Patrick D., Lynam Donald R., Loeber Rolf, and Stouthamer-Loeber Magda, 2011, Role of test motivation in intelligence testing.
- Gottfredson Linda S., 2000, Skills Gaps, Not Tests, Make Racial Proportionality Impossible.
- Gottfredson Linda S., 2004, Social Consequences of Group Differences in Cognitive Ability.
- Herrnstein Richard J., and Murray Charles, 1994, The Bell Curve: Intelligence and Class Structure in American Life, With a New Afterword by Charles Murray.
- Jensen Arthur R., 1998, The g Factor: The Science of Mental Ability.
- Jussim Lee, Cain Thomas R., Crawford Jarret T., Harber Kent, Cohen Florette, 2009, The Unbearable Accuracy of Stereotypes.
- Lubke Gitta H., Dolan Conor V., Kelderman Henk, Mellenbergh Gideon J., 2003, On the relationship between sources of within- and between-group differences and measurement invariance in the common factor model.
- Nguyen Hannah-Hanh D., and Ryan Ann Marie, 2008, Does Stereotype Threat Affect Test Performance of Minorities and Women? A Meta-Analysis of Experimental Evidence.
- Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1994, No More Than Skin Deep: Ethnic and Racial Similarity in Developmental Process.
- Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1995, Ethnic and Racial Similarity in Developmental Process: A Study of Academic Achievement.
- Rushton J. Philippe, and Jensen Arthur R., 2005, Thirty Years of Research on Race Differences in Cognitive Ability.
- Rushton J. Philippe, and Jensen Arthur R., 2010, Race and IQ: A Theory-Based Review of the Research in Richard Nisbett’s Intelligence and How to Get It.
- Sackett Paul R., Hardison Chaitra M., and Cullen Michael J., 2004, On Interpreting Stereotype Threat as Accounting for African American-White Differences on Cognitive Tests.
- Sackett Paul R., Borneman Matthew J., and Connelly Brian S., 2008, High-Stakes Testing in Higher Education and Employment.
- Steele Claude M., and Aronson Joshua, 1995, Stereotype Threat and the Intellectual Test Performance of African Americans.
- Steele Claude M., and Aronson Joshua, 1998, Stereotype Threat and the Test Performance of Academically Successful African Americans, pp. 401-426, in The Black-White Test Score Gap, by Jencks Christopher and Phillips Meredith (1998).
- Stoet Gijsbert, and Geary David C., 2012, Can Stereotype Threat Explain the Gender Gap in Mathematics Performance and Achievement?.
- Wicherts Jelte M., 2005, Stereotype Threat Research and the Assumptions Underlying Analysis of Covariance.
- Wicherts Jelte M., & Dolan Conor V., 2010, Measurement Invariance in Confirmatory Factor Analysis: An Illustration Using IQ Test Performance of Minorities.
- Wicherts Jelte M., Dolan Conor V., and Hessen David J., 2005, Stereotype Threat and Group Differences in Test Performance: A Question of Measurement Invariance.
- Wicherts Jelte M., and Millsap Roger E., 2009, The Absence of Underprediction Does Not Imply the Absence of Measurement Bias.
- Wicherts Jelte M., and de Haan Cor, 2009, unpublished, Stereotype threat and the cognitive test performance of African Americans.