Charles Murray on the Consistency of the Black-White Difference in IQ

Charles Murray (2006, 2008) disputes Dickens and Flynn’s study (2006) “Black Americans reduce the racial IQ gap: Evidence from standardization samples” showing a decrease in Black-White IQ differences. In fact, the Black-White gap shows no decrease. He further suggests (2008) in another study that the IQ of black americans could be 78, not 85, during the World War I, given a gap of 1.52σ instead of 1.16σ. Based on Murray’s conclusion, Rushton and Jensen (2010) hypothesized that the actual B-W gap could be 1.52σ (with an average black american IQ near 78) to the extent that studies excluded the poorest african americans, who lived in the rural South.

Changes over time in the black–white difference on mental tests: Evidence from the children of the 1979 cohort of the National Longitudinal Survey of Youth

Data for three Peabody achievement tests and for the Peabody picture vocabulary test administered to children of women in the 1979 cohort of the National Longitudinal Survey of Youth show that the black–white difference did not diminish for this sample of children born from the mid 1970s through the mid 1990s. This finding persists after entering covariates for the child’s age and family background variables. It is robust across alternative samples and specifications of the model. The analysis supplements other evidence that shows no narrowing of the black–white difference in academic achievement tests since the late 1980s and is inconsistent with recent evidence that narrowing occurred in IQ standardizations during the same period. A hypothesis for reconciling this inconsistency is proposed.

1. Introduction

[…] A meta-analysis of the B–W difference on cognitive tests (Roth, Bevier, Bobko, Switzer, & Tyler, 2001) did not analyze trends, but their conclusion from test results extending into the 1990s was that the B–W difference for highly g loaded test batteries centers on 1.1 standard deviations, in line with characterizations of the historic B–W difference (Herrnstein & Murray, 1994; Rushton & Jensen 2005; Dickens & Flynn, in press-a).

2.3. Analytic procedure

The analysis consists of a random effects generalized least squares (GLS) analysis in which the group variable (i.e., the unit comprised of repeated measures) is the child and the panel variable is test year. Robust standard errors were obtained, adjusted for repeated measures of the same child. For the analysis of trends over time for each race, each specification of the model was implemented separately for each race, with the child’s birth date as the variable of interest. For the analysis of changes in the B–W difference over time, the independent variables were entered along with a categorical variable for race (black=1) and interaction terms by race for each of the other independent variables, with the interaction term between race and birth date being the variable representing the net change in the B–W difference. A positive coefficient indicates that black scores rose with birth date after taking all the other covariates into account, thereby signifying a narrowing B–W difference; a negative coefficient indicates that black scores fell with birth date, signifying a widening B–W difference.

3. Results

3.1. Trends by birth date by race

Fig. 1 below shows the trends for each of the four tests by birth date, unadjusted for any covariates. Each child in the sample is represented by a single score (the median score for children tested more than once). The trend lines are produced by a bivariate regression of test score on child’s birth date. The squares and circles show means by birth year when the sample size for that year was 50 or more. Unadjusted test scores rose over time for both blacks and whites for all of the tests. As an inspection of the regression lines indicates, the B–W difference remained unchanged for the two reading tests and increased slightly for mathematics and the PPVT-R.

[…] Table 2 shows what happens to the unadjusted results when the covariates are entered in the random effects GLS analysis, focusing on the variable of central interest, child’s birth date. The full results for the regression are shown in Appendix Table A-1.

Once the covariates have been taken into account, the rise in scores shown in Fig. 1 disappears. Instead, the overall story is one of flat or falling scores for both white and black children over time. The trends over time varied by race, as follows:

Whites: After adjusting for the covariates, the overall trend in white test scores over time was markedly down for reading comprehension, reaching statistical significance in three of the four models. The trend was effectively flat for math and slightly but insignificantly downward for reading recognition and the PPVT-R.

Blacks: Black scores fell on all four tests after adjusting for the covariates, in all four specifications of the model. The drop in reading comprehension reached statistical significance in all four models. The drop in the PPVT-R was of similar magnitude in all four models (.31–.40 points per birth year) and reached statistical significance in both of the models using dummy variables for mother’s age at birth. The reductions in reading recognition and math were smaller and did not reach statistical significance in any of the specifications of the model.

The role of other independent variables: This analysis was not designed to explore the role of the other independent variables in detail, but these generalizations apply:

The role of the child’s age at time of testing, represented by age-appropriate grade in Table 2, was strikingly different for white and black children. For whites, scores in math and the PPVT-R rose significantly with age, while falling in reading comprehension. For blacks, scores fell with age on all the tests except the PPVT-R. These results were substantial for the full sample but ambiguous in the sample matched on mother’s age at birth.


It is this cluster of effects – the children born later in time were, on average, born to smarter, better-educated, more affluent women, as well as to women who were older and presumptively more mature – that accounts for the rising trend lines by birth date shown in Fig. 1 and the reversal of those slopes in the multivariate analyses.

3.2. Trends in the B–W difference over time

Full results for the analysis using a dummy variable for race and interaction terms are shown in Table 3 on the next page. Table 4 below summarizes the results for the interaction term between race and birth date, the variable that estimates the change in the B–W difference. Table 4 also recasts these coefficients in terms of the implied change in the B–W difference per decade, expressed in standard deviations.

The changes are small and statistically consistent with an interpretation of “no change” in the B–W difference, but the sign of the coefficient for the interaction term is consistent within tests. All four specifications of the reading recognition test indicated a small convergence in the B–W difference (even the coefficient that rounded to .00 in Model 1 was positive at the third decimal place). All four specifications of the other three tests indicated a small increase in the B–W difference. The largest increase, still not reaching statistical significance, was for the PPVT-R, where the implied increases in the B–W difference ranged from .13 to .19 S.D.s per decade in the four versions of the analysis.


The results in Table 4 supplement previous evidence from the NAEP and SAT that no closure in the B–W difference occurred among children born from the early 1970s into the last half of the 1990s on achievement tests. The results are inconsistent with the narrowing of the B–W difference in the IQ test standardizations during the same period found by Dickens and Flynn (in press-a). This is most explicitly true for the PPVT-R, a measure of verbal IQ, but the failure of the PIAT-RC and PIAT-M to converge is also relevant, insofar as both tests measure reasoning ability along with learned knowledge.

Potential technical explanations for this inconsistency were explored. One possibility was that the selection of NLSY children for testing created an artifact. The NLSY tried to test all children who were eligible (because of their age or prior testing history) for a given test on a given test year, but, after the initial 1986 test wave, it was decided to test only children who were living full-time or part-time with their mothers. This criterion meant that proportionately more black than white children were ineligible for testing (15.2% of testing opportunities for black children compared to 10.2% for white children). Not living with the mother is associated with factors that might tend to depress test scores (abandonment, foster care, institutionalization). The potentially lowest-scoring black children were in this respect probably underrepresented in the NLSY sample. But this artifact would tend to understate the real B–W test differences for the test waves following 1986, because the omitted children, disproportionately disadvantaged, were also disproportionately black. No other inconsistencies in the selection of black and white children for testing were identified.

A second possibility was that the sample for this analysis was systematically skewed by including cases that were part of the NLSY’s oversampling of blacks and low-income whites. To explore this possibility, Models 1 and 2 were replicated with observations restricted to the 13,602 observations of children of NLSY women who were part of the nationally representative cross-sectional sample. The results are shown in Appendix Table A-2, but they may be summarized quickly: None of the differences in the results from the two samples approached statistical significance. The results leave open the possibility that profiles of the B–W difference vary by socioeconomic class, but that topic requires a full-scale analysis of its own.

A third possibility was that the use of birth date as the independent variable of interest affected the results. Age at testing had a significant interaction effect with race, which could overlap with the effect of the interaction between race and birth date, raising potential multicollinearity problems. Accordingly, all of the analyses were replicated substituting test year for birth date in the GLS regressions. The results for Models 1 and 2 are also shown in Table A-2. Without exception, the interaction between race and test year produces coefficients that are within a few hundredths of the coefficients for the interaction between race and birth date.

A fourth possibility was that all of the models control for too much. If the socioeconomic position of black NLSY mothers improved relative to white mothers during the observation period, controlling for the family background variables could mask environmentally-caused improvements in the performance of the children. Replications of the interaction analyses using reduced models that omitted mother’s AFQT score, education at birth, marital status at birth, and family income were conducted to test this possibility. The results from Models 1 and 2 are shown in Table A-2. All of the negative coefficients are larger when the family background variables are omitted, and the sign of the coefficient for reading recognition changes from positive to negative.

Finally, it may be asked whether the dummy variables used for mother’s age at birth, consisting of 4-year segments, were sufficiently narrow to preclude multicollinearity problems within the segments. Three versions of Model 2 were explored, dividing the mother’s age at birth into 2-year and 1-year segments as well as 4-year segments. Those results are also summarized in Table A-2. The version with the 4-year segments presented in the text produced the smallest estimates of increases in the B–W difference.

The magnitude and components of change in the black–white IQ difference from 1920 to 1991: A birth cohort analysis of the Woodcock–Johnson standardizations

The black–white difference in test scores for the three standardizations of the Woodcock–Johnson battery of cognitive tests is analyzed in terms of birth cohorts covering the years from 1920 through 1991. Among persons tested at ages 6–65, a narrowing of the difference occurred in overall IQ and in the two most highly g-loaded clusters in the Woodcock–Johnson, Gc and Gf. After controlling for standardization and interaction effects, the magnitude of these reductions is on the order of half a standard deviation from the high point among those born in the 1920s to the low point among those born in the last half of the 1960s and early 1970s. These reductions do not appear for IQ or Gc if the results are restricted to persons born from the mid-1940s onward. The results consistently point to a B–W difference that has increased slightly on all three measures for persons born after the 1960s. The evidence for a high B–W IQ difference among those born in the early part of the 20th century and a subsequent reduction is at odds with other evidence that the B–W IQ difference has remained unchanged. The end to the narrowing of the B–W IQ difference for persons born after the 1960s is consistent with almost all other data that have been analyzed by birth cohort.

2.3. Approach to the effects of test age

The first test-age issue involves the young. The decision not to include subjects tested before age six was taken, first, because children younger than five were given a reduced battery of tests, and tracking changes in the full battery of tests is an objective of this analysis, and, second, because the B–W difference on mental tests among the very young is systematically different from the difference among older subjects. In infancy, the B–W difference can be close to zero (Fryer & Levitt, 2004). The difference rises through the preschool years, usually reaching about 0.70σ on full-scale IQ batteries by 5 to 6 years of age, then rising within a few years to about 1.0σ where it stabilizes for the rest of elementary school (Jensen, 1998). The results from the Woodcock–Johnson standardizations follow the common pattern for IQ tests. The B–W differences for children tested at age five were just 0.30σ, 0.12σ, and 0.69σ for WJ1, WJ2, and WJ3 respectively, with a combined difference of 0.57σ (white n=555, black n=83) rising immediately thereafter to 1.00σ, 0.81σ, and 1.13σ respectively at age 6, with a combined difference of 0.87σ (white n=733, black n=135). [1]

3. Results

3.1. Period analysis

Table 2 shows the kind of period analysis of the black–white difference that has been most common in the literature, comparing test results according to the years when the test was administered. In this instance, Table 2 shows the means and B–W differences on overall IQ and the seven cognitive clusters for each of the three Woodcock–Johnson standardizations using the unadjusted scores.

A substantial reduction in the B–W difference in IQ occurred from WJ1 to WJ2, from 1.23σ to 0.90σ, a drop of one third of a standard deviation. The difference increased again to 1.05σ in WJ3, but a net reduction of 0.18σ from WJ1 to WJ3 remains, equivalent to about 3 IQ points. Turning to the six clusters of cognitive functioning for which measures were available in all three standardizations, a substantial reduction in the B–W difference from WJ1 to WJ3 is observed for Ga (0.47σ), with small reductions for Gsm, Glr, and Gs. The two most highly g-loaded clusters, Gc and Gf, showed small increases. In summary, the period analysis of the black–white difference in the Woodcock–Johnson shows reductions in the B–W difference in overall intelligence and most of its components. These improvements were apparently concentrated in the 1980s.

3.2. Birth cohort analysis of the B–W difference among persons born from 1920 to 1991

Fig. 1 recasts the numbers shown in Table 2 as a birth cohort analysis of overall IQ, using means by birth year. […] For the birth years prior to 1956, black sample sizes were not large enough to produce interpretable means over short periods of time. The two dots show the B–W difference in standard deviations for persons born from 1920 to 1939 (white n=429, black n=74) and 1940–55 (white n=693, black n=94).

The B–W difference among persons born from 1920 to 1939 was 1.33σ. The difference dropped to 1.08σ for those born from 1940 to 1955. When line begins in 1958, the difference was extremely large, reaching a high of 1.45σ in 1959. The difference dropped steeply throughout the 1960s, reaching its low in 1972, at 0.83σ. For those born most recently, 1987–1991, the difference was 0.98σ.

The corresponding plots are shown for the separate cognitive clusters in Fig. 2. The trends in the cognitive clusters have been widely divergent. The clusters with the highest g-loadings are Gc, comprehension-knowledge, and Gf, fluid reasoning. Both of them show a drop in the B–W difference, reaching lows in 1966 of 0.94σ for Gc and 0.57σ for Gf. For those born most recently, 1987–1991, the difference had risen to 1.19σ and 0.71σ respectively.

Table 3 shows the regression results, with the interactions between race and birth date in boldface. Table 4 summarizes these results in terms of the B–W difference produced by fitted values and compares them with the corresponding scores used to produce Figs. 1 and 2.

If the Woodcock–Johnson had tested only people under the age of 40 from mid-century (Model 2), the same regression equation provides evidence that the B–W difference in IQ not only failed to narrow, but widened from a fitted values of 0.75σ for those born in 1947 to 1.05σ for those born in 1981. The fitted values show a smaller increase for Gc, from 0.86σ to 0.99σ, and a decrease in the B–W difference for Gf, from 0.90σ to 0.74σ.

The choice of interpretation depends on the weight one places on the scores of persons tested at older ages in the 1920s and 1930s, when the B–W difference was extremely large. For the period 1920–39 represented in Fig. 1, the B–W difference for the combined samples was 1.33σ (n=74). But the B–W difference was even larger for those born in the 1920s, standing at 1.59σ (black n=31).

4. Discussion

[…] If the evidence is restricted to persons tested under the age of 40 (Model 2), the multivariate analysis provides no support for a narrowing B–W difference in IQ and Gc for persons born from the late 1940s onward, even if the low points of the B–W difference (early 1970s for IQ and the mid-1960s for Gc) are used as the end point for the comparison. A case can be made for a substantial reduction in the B–W difference in Gf if the mid-1960s is used as the end point for the comparison, but not if the end point is extended to the 1980s or early 1990s.


The conclusion that the B–W difference narrowed is countered by the earliest measures of racial differences in IQ, which consist of a large number of studies catalogued in Shuey (1966) showing an average B–W difference of no more than 1σ (Loehlin, Lindzey, & Spuhler, 1975; Gottfredson, 2005) and the Army Alpha and Army Beta tests used during World War I, representing men born around 1900, which showed a B–W difference of 1.16σ (Loehlin, Lindzey, & Spuhler, 1975, based on Yerkes, 1921). If those results are taken at face value, they overwhelm the evidence for a higher B–W difference during that era obtained from the Woodcock–Johnson standardizations.

They cannot be taken at face value, however. At the time of World War I, almost 70% of all blacks still lived in the rural South (Myrdal, 1944), unschooled or very poorly schooled. This population, presumptively with the lowest mean black IQ, is effectively unrepresented in the Shuey studies, and there is reason to believe that it was radically underrepresented among those draftees who reached the point of being administered the Army Alpha and Army Beta tests (Keith, 2004).

The caution with which one must approach the World War I data is accentuated by the data from World War II. The B–W difference on the Army General Classification Test for inductions in 1944–1945 has been put at 1.52σ (Loehlin, Lindzey, & Spuhler, 1975). This represents the scores of men born from 1925 to 1927, and is very close to the 1.59σ difference observed among the Woodcock–Johnson subjects born in the 1920s. How could the B–W difference in IQ have risen from 1.16σ to 1.52σ in 20 years? The simplest explanation is that the World War II testing produced a more accurate nationally representative estimate of the B–W difference than did the World War I testing.


The NAEP, the WAIS standardizations, and the Stanford–Binet adult scale standardizations all show a narrowing of the B–W difference prior to the 1970s and an end to narrowing sometime during the 1970s. Specifically:

The NAEP. The B–W difference in the NAEP was first measured for the cohort born in 1954. On reading, math, and science, the difference at baseline was at least 1.03σ and as high as 1.29σ. Subsequent rounds of the NAEP showed a narrowing B–W difference for cohorts born throughout the 1960s. The narrowest point in the B–W difference through the 2004 administration of the NAEP occurred among cohorts born from 1969 to 1979, varying by test from 0.38σ to 1.04σ (Perie & Moran, 2005). [4]

The WAIS standardizations. The B–W difference in the WAIS was at its widest among cohorts born in the 1920s, at about 1.14σ. It reached its narrowest level, 0.73σ, among cohorts born from 1971 to 1975.

Stanford–Binet adult scale standardizations. The narrowing in the standardizations of the Stanford–Binet adult scale, from 1.11σ to 0.98σ, occurred in the period between cohorts born from 1962 to 1973 and cohorts born from 1978 to 1989.

One source of data, the AFQT standardizations conducted in 1980 and 1997, shows a narrowing of the B–W difference from 1.21σ to 0.97σ among persons born in the 1960s and/or 1970s (author’s analysis of NLSY-79 and NLSY-97). The narrowing in the B–W difference occurred sometime after 1964, the last birth year for NLSY-79, and before 1980, the earliest birth year for NLSY-97. There is no way to know what years from 1965 to 1979 account for the bulk of the reduction.

Two sources show an unchanging B–W difference for persons born since the mid-1970s with no information on when the stable difference began:

Children of the NLSY-79 Women. The B–W difference on the Peabody reading recognition, reading, mathematics, and picture vocabulary tests increased slightly for the children of women in the NSLY-79 cohort born from the mid-1970s through the mid-1990s (Murray, 2006).

The Stanford–Binet children’s scale standardizations. The B–W difference was effectively flat between cohorts born from 1974 to 1978 and cohorts born from 1990 to 1994 (0.65σ and 0.62σ respectively).

Only two data sources that I have been able to identify are seemingly inconsistent with the Woodcock–Johnson results:

The GSS vocabulary test. GSS data are now available through the 2004 survey, 6 years longer than the observation period available to Huang and Hauser (2001), and they show a continuing decline in the B–W difference for persons born into the early 1980s (author’s analysis of the GSS). […] The decline in the B–W difference in the GSS vocabulary test for persons born since mid-century is entirely attributable to a decline in white performance, not improvement in black performance.

The WISC standardizations. The only known clear contradiction between the Woodcock–Johnson results and another longitudinal data set involves the WISC standardizations. For children tested at ages 6–16, the B–W difference on the WISC narrowed from 1.08σ among cohorts born from 1973 to 1983 to 0.78σ among cohorts born from 1986 to 1996. With that exception, all the available data on changes in the B–W difference across birth cohorts is consistent with the proposition that narrowing in the difference ended no later than the close of the 1970s, with the bulk of the evidence pointing to the first half of the 1970s.

The results presented here generally support arguments (e.g., Rushton & Jensen 2005) that the B–W difference has narrowed least on the most highly g-loaded and most highly heritable cognitive clusters […]

It is worth noting that the actual black-white IQ gap is probably larger than 15 IQ points. Studies are based on school samples, thus omitting high school dropouts while there are more blacks than whites who drop out of high school. Studies also omit the prisoners since incarcerated offenders have lower IQ than the public at large (The Bell Curve, p. 242) while there are more blacks than whites incarcerated in prisons. Rushton made the assumption that “educational researchers seldom get to examine the very lowest scoring segments of the Black population in inner cities” and that the actual black-white IQ gap could be underestimated given that “An IQ of 71 was found for the Black children in an entire school district from a rural county in Georgia in the U. S. Deep South; the White IQ in the same county was 101 [30]”.

Even if the black-white gap can be narrowed in the near future, this does not (and never) imply that the black-white gap in achievement and IQ can be completely nullified.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s