JACK A. NAGLIERI
Ohio State University
ARTHUR R. JENSEN
University of California, Berkeley
INTELLIGENCE 11, 21-43 (1987)
The Kaufman Assessment Battery for Children (K-ABC) and the Wechsler Intelligence Scale for Children-Revised (WISC-R) are compared with respect to the magnitudes of the average white-black differences in standardized scaled scores and in raw scores. The two test batteries were administered to a sample of 172 fourth- and fifth-grade children comprising 86 black-white pairs matched on age, sex, school, and socioeconomic status. The K-ABC and WISC-R are highly correlated, and the general factor, or g, of one battery is virtually identical to the g of the other. The high positive correlation between the size of the white-black difference on the various subtests of both batteries and the subtests’ loadings on the g factor bears out Spearman’s hypothesis that a test’s white-black discriminability is a direct function of the test’s g loading.’ The lesser white-black discriminability of the K-ABC relative to the WISC-R is attributable to (1) the smaller g loadings of the K-ABC subtests and (2) the presence of other factors, particularly sequential short-term memory which, to some degree, offsets the white-black difference in g.
The Kaufman Assessment Battery for Children (K-ABC) (Kaufman & Kaufman, 1983) owes some of its initial prominence and popularity to its claim that the magnitude of the average black-white difference on the K-ABC IQ is smaller (by about one-half) than the average black-white IQ difference generally reported for most other standardized tests of intelligence, such as the Stanford-Binet and the Wechsler Intelligence Scale for Children-Revised (WISC-R) (Wechsler, 1974), which is the most widely used individual IQ test in today’s schools.
As Jensen (1984) and Naglieri (1986) have previously pointed out, the claim of a lesser black-white difference on the K-ABC was based on evidence that is methodologically far from optimal for establishing this claim. In comparisons of the K-ABC with other tests, such as the WISC-R, both tests were not administered to the same black and white samples. Moreover, comparisons were based on the scaled scores derived from the different standardization samples of the K-ABC and the WISC-R, which renders ambiguous the locus of the tests’ difference in black-white discriminability, that is, whether the difference in discriminability is intrinsic to the nature of the tests themselves or is merely an artifact attributable to differences in the range of ability represented in the tests’ standardization samples.
By testing the same samples of black and white children on both the K-ABC and the WISC-R, Naglieri (1986) found that, although the white sample obtained higher scaled scores than the black sample on all 11 of the WISC-R subtests and on all 13 of the K-ABC subtests, the sizes of the black-white differences generally were slightly smaller on the K-ABC than on the WlSC-R, but not smaller to the degree claimed by the authors of the K-ABC (Kaufman & Kaufman, 1983). This smaller effect, however, is probably due in some part to the fact that Naglieri’s black and white samples were closely matched on socioeconomic status. Because the comparisons were made using the standardized scaled scores of each test, the question of the degree of intrinsic difference between the K-ABC and WISC-R with respect to black-white discriminability essentially remained unanswered. Black-white differences in standardized scores on the two tests reflect to some unknown degree possible differences in the tests’ standardization samples.
The purpose of the present study is to throw light on this question by basing comparisons on both standardized scores and raw scores. In addition, the study seeks to elucidate the basis for the K-ABC’s lesser black-white difference in terms of the hypothesis originally attributable to Charles Spearman (1927), who suggested that tests differ in their black-white discriminability as a positive function of their loadings on the general factor, or g, defined as the largest common factor in a collection of diverse cognitive tests. Accordingly, Jensen (1984) has argued that the lesser black-white discriminability of the K-ABC, relative to the WISC-R, is due to the smaller g saturation of the K-ABC, particularly the eight subtests of the K-ABC Mental Processing scale on which the IQ is based. The present study tests both Spearman’s hypothesis and Jensen’s application of it to understanding the nature of the difference between the K-ABC and the WISC-R with respect to their black-white discriminability.
The sample comprised a total of 172 elementary school pupils in the fourth and fifth grades of three elementary schools in the central district of Columbus, OH. These three schools were selected because they enrolled approximately equal numbers of black and white children, which made it feasible to match the black and white samples on relevant variables. A matched-pairs procedure was used. Each black child (N = 86) was matched with a white child on age (~3 months), school, sex, and socioeconomic status (SES), resulting in 86 black-white matched pairs (40 males, 46 females). A 5-point scale of SES similar to the one presented in the WISC-R Manual (Wechsler, 1974, p. 18) was adopted. It is based mainly on the parent’s occupational level (the higher of the two levels if both parents were employed). Children whose parents were unemployed or had very low incomes were classed as SES Level 4 if the child was enrolled in the reduced-cost lunch program; those enrolled in the free lunch program were classed as SES Level 5. The mean SES for whites was 3.4 (SD = 1.2) and for blacks, 3.4 (SD = 1.2). The mean age for whites was 10.7 years (SD = 8.2 months; range = 9.3-12.4 years), and for blacks, 10.8 years (SD = 8.0 months; range 9.4-12.4 years). The distribution of blacks and whites in the five levels of SES is as follows:
The WISC-R and K-ABC were individually administered in two sessions a week apart by two white examiners, one male, one female. Each examiner tested the same matched black and white children on both the WISC-R and K-ABC. The two tests were given in a completely counterbalanced order (by race), with the matched pairs as the unit of assignment to the two order conditions; the members of any given pair were always given the two tests in the same order.
The WISC-R subtests are familiar enough to the readers of this journal not to need description here. The subtests of the newer and less well-known K-ABC, however, warrant brief descriptions:
Mental Processing Scale
Hand movements (HM) Imitating a series of three hand movements in the same sequence as the examiner performed them.
Number recall (NR) Repeating a series of digits in the same sequence as the examiner said them.
Word order (WO) Touching a series of pictures in the same sequence as they were named by the examiner, with more difficult items employing a color interference task.
Gestalt closure (GC) Naming the object or scene pictured in a partially completed “inkblot” drawing.
Triangles (T) Assembling several identical triangles into an abstract pattern that matches a model.
Matrix analogies (MA) Selecting the picture or abstract design that best completes a visual analogy.
Spatial memory (SM) Recalling the placement of pictures on a page that was exposed briefly.
Photo series (PS) Placing photographs of an event in chronological order.
Faces & places (FP) Naming the well-known person, fictional character, or place pictured in a photograph or illustration.
Arithmetic (AR) Answering a question that requires knowledge of math concepts or the manipulation of numbers.
Riddles (R) Naming the object or concept described by a list of its characteristics.
Reading/decoding (RD) Naming letters and reading words.
Reading/understanding (RU) Acting out commands given in written sentences.
RESULTS AND DISCUSSION
Comparison of Black-White Differences on WISC-R and K-ABC
Table 1 shows the mean raw scores and the standardized scores (based on the national standardization sample) for the black and white samples. For both the raw scores and standardized scores, the mean difference between the black and white samples is expressed in units of the average within-groups standard deviation, labeled σ Diff. (see Footnote a in Table 1).
Several points in Table 1 should be noted: The raw score black-white differences (σ Diff.) are very highly correlated with the standardized score differences, .984 for the WISC-R and .935 for the K-ABC. (The corresponding rank-order correlations are .975 and .960.) The lower correlation for the K-ABC is probably not a statistically significant deviation and seems too slight to warrant any interpretation.
How similar is the pattern of the mean black-white differences on the 11 WISC-R subtests (standardized scores) in the present study to the corresponding pattern of black-white differences found in the national standardization sample of the WISC-R? The rank-order correlation between the two patterns (i.e., the two analogous sets of mean white-black differences on the 11 WISC-R subtests) is +.72. The same question was asked concerning the mean white-black differences on the 13 subtests of the K-ABC. The pattern of the 13 white-black differences (in standard scores) found in the present study has a rank-order correlation of +.81 with the corresponding pattern of white-black differences in the national standardization sample of the K-ABC. Hence, the patterns of white-black differences across the subtests of the WISC-R and the K-ABC in the present samples are not atypical of the corresponding patterns tound in the national standardization samples. This fact enhances the generalizability of the present findings, although the present samples are matched on schools and SES, whereas the standardization samples are not.
Although the pattern of differences is much the same for both the raw scores and the standardized scores, the white-black differences differ in absolute level for the two types of scores. The absolute difference between the raw score and standardized score differences is greater for the K-ABC than for the WISC-R. The mean (across subtests) white-black differences (in cr units) for raw and standardized scores are as follows:
This result contradicts Jensen’s (1984) conjecture that a part of the lesser white-black difference on the K-ABC than on the WISC-R might be attributable to differences in the variance of ability in the different standardization populations of the WISC-R and K-ABC. In fact, in the present samples, the white-black difference on K-ABC standardized scores in larger than on the raw scores, and the K-ABC differs more from the WISC-R on the raw score white-black differences than on the standardized score differences. The standardization of the K-ABC scores, therefore, appears not to be one of the mechanisms responsible for the generally smaller white-black difference on the K-ABC relative to the white-black difference on the WISC-R.
The same effects are seen in the WlSC-R and K-ABC scale composites shown at the bottom of Table 1. The K-ABC Mental Processing composite has a white-black difference only 57% (i.e., 100 × .42σ/.73σ) as large as the Full Scale WISC-R for raw scores and 75% (i.e., 100 x .58σ/.77σ) as large for standardized scores. In terms of the tests’ actual IQ scales, the white-black difference on the Full Scale WISC-R is 9.1 IQ points; on the K-ABC Mental Processing Scale it is 6.0 IQ points, or two-thirds the size of the difference on the WISC-R. The K-ABC Achievement Scale shows a white-black difference equivalent to 6.8 IQ points, although the K-ABC does not use the achievement subtests in computing the IQ, which is based entirely on the mental processing subtests. But, the main point to be noted here is that the K-ABC yields smaller white-black differences, expressed in cr units, than the Full Scale WISC-R, for all of the K-ABC scales, whether compared in terms of raw scores or standardized scores. However, the raw-score comparisons are the more essential for our purpose, because they are wholly unaffected by any possible differences in the standardization samples of the WISC-R and the K-ABC.
Correlations between WISC-R and K-ABC
The Pearson correlation coefficients between all of the WISC-R and K-ABC subtests are shown separately for the black and the white samples in Table 2. To determine whether the correlation matrices of the two samples differ significantly, the Jennrich (1970) chi-square test for the equality of two correlation matrices was applied. The resulting χ², with 276 df, is 1817.95, indicating a difference that is significant beyond the .001 level. This means, too, that the factor structure of the combined WlSC-R and K-ABC batteries could be significantly different between the black and white samples, although the few largest and significant factors possibly might not differ appreciably.
The general factor common to all of the subtests of both batteries, however, is highly similar — in fact, virtually identical — in both samples. As it is the general factor that constitutes most of the variance in the composite scale scores of each battery, it is not surprising that the WISC-R and K-ABC scales show high and similar correlations in both samples, as shown in Table 3.
The high correlations between the K-ABC Mental Processing and Achievement scales with the WlSC-R Full Scale IQ presented in Table 3 demonstrate that these scales are both more highly correlated with the WISC-R Full Scale than they are with each other (K-ABC Mental Processing x Achievement correlations are .50, .58, and .58 for the black, white, and total samples, respectively). This finding is further supported by Naglieri (1985a), who found that the WISC-R Full Scale IQ is correlated significantly higher with the K-ABC Achievement scale than with the K-ABC Mental Processing scale. Additionally, Naglieri (1984), Naglieri and Anderson (1985), and Naglieri (1985a, 1985b) found the K-ABC Achievement and the WISC-R Verbal to be the most highly correlated of the separate scales of the two tests. Because these scales essentially measure verbal knowledge acquired through educational and other means, this result is expected.
The strong correlations between the WISC-R Performance and K-ABC Simultaneous scales evident in Table 3 reflect the nonverbal spatial nature of the tasks on each of the scales. This strong Simultaneous/Performance relationship also influences the correlations between the Mental Processing and Performance scales because the K-ABC Mental Processing scale is simultaneously weighted. That is, the Mental Processing scale comprises five Simultaneous and three Sequential subtests, one of which — Hand Movements — evidenced strong simultaneous loadings (Kaufman & Kamphaus, 1985), that contribute equally to the scale. These results indicate that the WISC-R and K-ABC scales are consistent in the measurement of verbal/achievement and nonverbal/spatial intellectual skills.
The above correlations may be compared with the correlations between other IQ tests. In a quite comprehensive review of the correlations between many different intelligence tests, it was found that the overall average correlation is +.67 (Jensen, 1980, pp. 314-316). Most such correlations fall in the range +.50 to +.80. In 47 studies reporting a correlation between the WISC IQ and the Stanford-Binet IQ, the correlations range from +.43 to +.94, with a median correlation of +.80. Hence, the K-ABC is not atypical in its correlations with the WISC-R. If the K-ABC based its IQ scale on all 13 of the subtests, rather than on just the 8 Mental Processing subtests, it would correlate as highly with the WISC-R Full Scale IQ as does the Stanford-Binet IQ. The K-ABC Mental Processing and Achievement scores together predict WISC-R Full Scale IQ with the following multiple correlations: black = .87, white = .84, and combined = .87. Corrected for attenuation, assuming a reliability coefficient of .90 for each of the three scales, these Rs would become .91, .95, and .95, respectively. Hence, the WISC-R and K-ABC scales obviously have a great deal of their truescore variance in common, undoubtedly because the composite score of each battery measures predominantly the same general factor.
The variance of group differences, as of individual differences, in the total scores of a test battery reflects the sum of the separate subtest variances plus twice the sum of all the covariances among the subtests. Consequently, a test battery that has generally lower intercorrelations among its subtests will be somewhat less reliably discriminating between individuals and between groups than a battery with higher subtest intercorrelations. The average intercorrelations among subtests of the WISC-R and of the K-ABC scales in the black and white samples are:
A principal factor analysis was performed on the intercorrelations among all 24 of the WISC-R and K-ABC subtests, separately in the black and white groups (Table 2) and in the combined groups. So that the white-black differences would not inflate the correlations for the combined groups, the point-biserial correlations of the dichotomized race variable with each of the 24 subtests were used to partial out race from all of the subtest intercorrelations. (When the two combined groups are of equal size, as is the case here, the correlation, rxy’ between any two variables, X and Y, with the group difference, Z, partialled out, i.e., rxy.z’, is equal to the mean of the separate zero-order correlations, rxy’in each of the groups.) It is essential for the subsequent test of Spearman’s hypothesis concerning the factorial composition of the white-black difference on psychometric tests that the factor analysis of the tests not be the least contaminated by the group difference itself.
Three criteria were appealed to for determining the number of factors: the Kaiser-Guttman rule of extracting only factors with eigenvalues greater than 1, Cattell’s scree test, and a maximum likelihood factor analysis, using the LISREL program (Jrreskog & Srrbom, 1981), to test the goodness of fit of the three-factor model (see Gorsuch, 1983, Chapter 8). According to each of the three criteria, for all three correlation matrices (white, black, and combined), the correct number of factors is three. Two factors are clearly inadequate and, when four factors are extracted, the fourth factor, after orthogonal rotation, consists of merely a doublet, with quite small loadings on only the two arithmetic tests — one in the WISC-R and one in the K-ABC. Also, the maximum likelihood test yields a better goodness-of-fit index for the three-factors than for the four-factors solution.
The first three principal factors, then, were used to obtain a Schmid-Leiman (1957) orthogonalized hierarchical factor analysis in the separate and in the combined groups. The results are shown in Table 4.
The three primary factors, Verbal (V), Spatial (S) and Memory Span (MS), are especially clear after residualization by the Schmid-Leiman procedure. Moreover, the subtests with salient loadings (in italics) on each factor are highly similar in the white, black, and combined groups, indicating essentially the same factor structure in both racial groups. The second-order or general factor, g, on which our subsequent analyses mainly focus, shows a coefficient of congruence  of +.95 between the black and white groups, indicating that this g factor is essentially the same in both groups. Also, a second-order general factor was extracted from the combined data (with the group difference partiatled out) by means of maximum likelihood factor analysis. It shows a congruence coefficient of +.99 with the Schmid-Leiman g factor extracted from the same combined data. As can be seen in the bottom line of Table 4, in every analysis the second-order g factor accounts for more than twice as much of the common factor variance as the three first-order factors combined.
 The orthogonalized hierarchical factor analysis was performed by means of an SAS computer program based on the routine developed by Wherry (1959), which makes use of multiple group factor analysis and obviates the need for oblique rotation of the primary factors (derived in the present case by principal factor analysis). The simple structure criterion is approximated by removing the overlapping variance between the clusters identified by the principal factor analysis and expressing it as a higher order factor, in the present case, the second-order g factor. The final result — an orthogonalized hierarchical factor structure — is exactly the same factor matrix that is yielded by the method described by Schmid and Leiman (1957).
Although correlations between the oblique primary factors are not required for obtaining the final orthogonalized hierarchical solution by Wherry’s method, Wherry provides the routine for calculating them as an optional by-product of his method. In the present study, the correlations between the Verbal (V), Spatial (S), and Memory Span (MS) oblique primary factors in the white, black, and combined samples are:
 The coefficient of congruence, re, is an index of factor similarity on a scale of 0 to ± 1. Unlike the Pearson r, which, being based on standardized variates, reflects only the degree of similarity between the profiles (of factor loadings) per se, the congruence coefficient also reflects differences in the absolute values of the factor loadings. A value of re above +.90 is the usual criterion for concluding identity of factors, although some experts set a more stringent criterion at +.95. The congruence coefficient is computed as follows:
rc = Σab / SQRT (Σa² Σb²)
where a and b are the homologous factor loadings obtained on a given factor in groups A and B.
Concurrent Validity of the g Factor
Because the WISC-R and K-ABC subtests have not heretofore been factor analyzed together, we cannot compare the g of the present analysis with that of any other analysis. Yet, one might wonder if the present g bears much resemblance to the g of the WISC-R and K-ABC found in other samples. It turns out that factor analyzing the two batteries together creates very little distortion of the pattern of g factor loadings in either battery as compared to the g loadings obtained when the two batteries are each factor analyzed separately. In the present combined samples, the g factor loadings obtained from factor analyzing the WISC-R and the K-ABC separately were compared with the corresponding g loadings obtained from factor analyzing the two test batteries together. The coefficients of congruence between the two sets of g loadings thus obtained are +.998 for both the WISC-R and the K-ABC, indicating virtual identity of the g factor whether the WISC-R and K-ABC batteries are factor analyzed separately or combined.
Even when factor analysis is performed on the two batteries separately, and in entirely different samples, the resulting g factors are highly congruent with the g of the combined batteries in the present samples. It is striking evidence of the ubiquity and robustness of g. For example, the g loadings of the WISC-R subtests in the WISC-R national standardization sample (Jensen & Reynolds, 1982), obtained from a Schmid-Leiman hierarchical factor analysis of just the WISC-R subtests, show a congruence coefficient of +.99 with the present g factor loadings of the WlSC-R subtests obtained from the present Schmid-Leiman hierarchical factor analysis of the combined WISC-R and K-ABC batteries. The g loadings of the K-ABC subtests in the K-ABC national standardization sample, obtained from a hierarchical factor analysis of just the K-ABC for 10-year-aids (Keith & Dunbar, 1984, Table 2, p. 373), show a congruence coefficient of +.99 with the present g factor loadings of the K-ABC obtained from the combined WISC-R and K-ABC batteries.
Hence, the g factor of the WISC-R and K-ABC in the present samples is virtually identical to that of the national standardization samples. The fact that combining the two batteries does not distort the pattern of factor loadings within each battery is further evidence that the general factor in both batteries represents essentially one and the same g, despite the differences in appearaces between the WISC-R and K-ABC batteries. To determine the similarity of the g factor obtained in separate factor analyses of the WISC-R and K-ABC batteries in the present sample (total N = 172), g factor scores for the WISC-R and for the K-ABC were computed on every subject. The Pearson correlation between the g factor scores of the WISC-R and the g factor scores of the K-ABC is +.88. As the reliability of each test is probably not higher than .90, the correlation of +.88 corrected for attenuation (i.e., +. 88/.90) yields a true correlation of about +.98. Hence, the WISC-R and K-ABC have virtually the same g factor.
The white-black difference is far from being a constant amount on various kinds of tests. This is clearly evident by inspection of the two σ Diff. columns in Table 1, which show the white-black differences vary widely on different subtests. In the 24 subtests of the combined WISC-R and K-ABC batteries, the white-black differences in the present samples range from near zero (-.02σ) to +0.70σ. Spearman (1927, p. 379) hypothesized that the observed variation in the size of the black-white difference across different tests is primarily a function of the tests’ loadings on the g factor, the more g-loaded tests showing the larger differences. Jensen (1985a) has made a detailed examination of 11 large data sets suitable for testing Spearman’s hypothesis and found the hypothesis consistently borne out. The overall correlation between g loadings and black-white differences (in standard score units) for 121 tests in 11 studies was +.59. Humphreys (1985a), however, has claimed only very weak support (a correlation of only +.19 between the g loadings and black-white differences on 74 tests in the Project TALENT battery) for Spearman’s hypothesis in a comparison of a representative sample of black high school students with a specially restricted sample of white students of low SES (the lowest 15 to 20% in SES of the total representative sample of white students). The specific features of Humphreys’s data from Project TALENT that would account for its unusually weak conformity to Spearman’s hypothesis have been pointed out elsewhere (Jensen, 1985b). Humphreys (1985b) later calculated the correlation of tests’ g loadings with black-white differences for the total representative samples of blacks and whites, without any restriction on SES, and it turned out to be +.59, exactly the same overall correlation found in the 11 studies analyzed by Jensen (1985a).
The question remains open, however, as to how strongly Spearman’s hypothesis is borne out, not when one racial group is intentionally restricted in SES and the other is not (which would be an inappropriate test of the hypothesis), but when both groups are matched on SES throughout the full range of SES. The present data, which were originally obtained for a study by Naglieri (1986), permit examination of Spearman’s hypothesis for black and white groups matched on SES. They also permit a test of the hypothesis suggested by Jensen (1984) that the K-ABC, particularly the Mental Processing scale, yields a smaller black-white difference than the WISC-R (and many other standardized IQ tests) because the K-ABC subtests are generally less g loaded than those of the WISC-R.
A primary test of Spearman’s hypothesis is the correlation between the various tests’ g factor loadings and the mean black-white difference (in tr units) on the tests. All subsequent analyses are based on the raw scores, so that the black-white differences on the WISC-R and K-ABC will not reflect possible differences in their standardization samples.
The Pearson correlation between g loadings and σ differences is +.78. Because there is no suitable significance test for the Pearson r in this case,  we also report the rank-order correlation, which is + .75, p < .01, one-tailed test. The scatter diagram for this correlation is shown in Figure 1, indicating the bivariate data points for the various WISC-R and K-ABC subtests.
The correlations between the white-black σ differences and the g loadings obtained in the separate groups are +.71 for the black and +.75 for the white. Over all of the 24 subtests, Spearman’s hypothesis appears to be strongly sustantiated.
It is possible, however, that the correlation between g loadings and σ differences could reflect merely the differential reliability coefficients of the various subtests. Both g loadings and σ differences are attenuated by measurement error, and the degree of attenuation would be correlated between the two variables. To determine if this has substantially affected our test of Spearman’s hypothesis, we have controlled for differences in subtest reliabilities in two ways. The most reliable subtest reliabilities available are those computed on the national standardization samples, obtained from the WISC-R and K-ABC test manuals. These were used in the following analyses. First, we partialled the subtest reliabilities out of the correlation between subtest g loadings and σ differences. The partial r is +.85, which is larger than the zero-order r of +.78. Second, we corrected both the g loadings and (r differences for attenuation (by dividing each by the square root of the subtest’s reliability coefficient). The correlation between the disattenuated g loadings and the disattenuated σ differences is +.80. Hence, variation in subtest reliabilities is not in the least responsible for the large positive correlation between subtests’ g loadings and the mean black-white differences, as predicted by Spearman’s hypothesis. Because the disattenuated data yield scarcely different results from those of the raw data, we will henceforth use only the raw data, unless noted otherwise.
Another way of looking at Spearman’s hypothesis, originally suggested by Gordon (1985), is based on the point-biserial correlation between each subtest score and the black-white dichotomy (quantitized as black = 0, white = 1). These point-biserial correlations for the 24 subtests are identical to the subtests’ loadings on a black-white factor that would be obtained by including the black-white dichotomy as a variable in a factor analysis of the 24 subtests and rotating the factor axes in such a way as to obtain a maximal loading of 1.00 on the black-white variable. The various subtests’ loadings on this “black-white” factor would be the same as the point biserial correlation of each subtest with the dichotomous black-white variable. This black-white factor, then, can be compared with the g factor (as represented by the first principal component) of the 24 subtests by the usual index of factor similarity, the congruence coefficient. This turns out to be +.965, which indicates that the g factor and the black-white factor (as reflected in these 24 subtests) are so similar as to be practically the same factor. The reason for our using the first principal component as the g factor in this context is now explained. Following Gorsuch (1983, p. 285), Gordon (1985) has suggested that one interpretation of the congruence coefficient is that it is equivalent to the value of the Pearson correlation that would obtain between the sets of factor scores derived from each of the two factors that are being compared for similarity. This interpretation of the congruence coefficient appears not to be exactly true. At least, it does not hold for the present data, for which all the required calculations were double-checked. Factor scores  on all 172 subjects were calculated from the black-white factor loadings (i.e., the point-biserial correlations between the black-white dichotomy, quantitized as 0 and 1, respectively) on the 24 subtests, and factor scores on the 172 subjects were also calculated from the loadings of the 24 subtests on their first principal component. As previously stated, the congruence coefficient between these two factors is +.965. But, the Pearson correlation between the two sets of factor scores is even higher: +.996. Thus, although Gorsuch’s interpretation of the congruence coefficient as an indicator of the correlation between factor scores does not seem to hold, Gordon’s argument that the black-white factor is essentially the same as the g factor, as represented by the first principal component, is strongly supported by the correlation of +.996 between the factor scores based on these two factors. (The congruence coefficient between the first principal component and the Schmid-Leiman second-order g factor is +.988.)
The WISC-R and K-ABC can be compared with respect to Spearman’s hypothesis. A most conspicuous feature of Figure 1 is that five of the eight Mental Processing tests on which the K-ABC IQ is based cluster in the lower left quadrant, that is, they have relatively low g loadings and relatively small black-white differences. Only the Coding and Digit Span tests of the WISC-R fall into this quadrant. All of the Mental Processing subtests in this quadrant except Gestalt Closure and Photo Series, as well as the two WISC-R subtests, are mainly tests of short-term sequential memory, or what has been termed Level I ability, on which white-black differences have generally been found to be minimal or even reversed when g is partialled out (Vernon, 1981). All of the Level I tests, of course, have some g loading, which accounts for their conformity to Spearman’s hypothesis. (The correlation between the g loadings and the mean white-black differences for just these five Level I type tests in the lower left quadrant of Figure 1 is +.90.) Hence, the comparatively low g loadings of five of the eight Mental Processing subtests are implicated in the smaller black-white IQ difference on the K-ABC, as compared with the WISC-R IQ. Three of the Mental Processing subtests do not fall into this cluster, and they are all tests that involve some spatial ability factor in addition to g, particularly Spatial Memory and Triangles, both of which fall above the regression line, that is, the black-white difference is larger than would be predicted from these tests’ g loadings. Evidently, another factor (spatial ability) increases the white-black differences on these tests. The same effect is seen for the two most spatially loaded subtests of the WISC-R — Object Assembly and Block Design.
The mean g loadings (g‾) and mean black-white differences in σ units (D‾), along with the conformance (r) of the subtests to Spearman’s hypothesis for the subtests of the WISC-R and K-ABC scales separately are as follows:
The K-ABC Achievement tests have nearly the same average g loading as the WISC-R and show the same overall white-black difference as the WISC-R subtests. The Achievement tests are the most deviant from the regression line (Figure 1). Tests that fall below the regression line are those that show a smaller white-black difference than is predicted by their g loading, and those above the regression line show larger differences than predicted.
These deviations from the regression line must mean that there are other factors besides g which add to or subtract from the size of the white-black difference in g. Hence, we must look at the correlation of the mean white-black σ differences (D‾) on each of the 24 subtests with the subtests’ loadings on each of the residualized first-order factors (combined groups). We have considered only the salient loadings which help to define each factor, as the nonsalient loadings and the differences between them are neither significantly nor appreciably greater than zero and could only add “noise” to the analysis. Table 5 shows the correlation of the g loading and of the salient loadings on each of the three firstorder factors with the white-black differences. The correlation of factor loadings with D‾ is negligible only for the Verbal factor, although the g loadings of just the verbal tests are correlated +.60 with D‾ on the verbal tests. Loadings on the Spatial factor are highly positively correlated (+.845) with D‾, indicating that tests of spatial ability disfavor blacks relative to whites, on average. The Memory Span factor shows the opposite, with a negative correlation (-.825) with D‾.
The lower half of Table 5 shows other revealing correlations of various components of the total variance with D‾. (Square roots of these variance components are used here to make them directly comparable with factor loadings.) Note that the specificity, error variance, and the total non-g variance (1 – g²), are all negatively correlated with D‾. The total variance accounted for by the residualized first-order factors is not significantly correlated (r = +.057) with D‾. This necessarily means that tests that minimize g and maximize uniqueness will also minimize the white-black difference in test scores.
Is this the secret of the smaller white-black difference on the K-ABC Mental Processing (MP) IQ as compared with the WISC-R IQ? Table 6 reveals the answer. Note that the proportion of g variance in the KABC-MP scales is less than in the WISC-R (.191 vs. .277), and the specificity of the KABC-MP is larger than of the WISC-R (.419 vs..290). The total non-g variance of the KABC-MP is .809 as compared with .723 for the WISC-R. From this fact alone, in accord with Spearman’s hypothesis, one would predict a smaller white-black difference on the K-ABC than on the WISC-R.
Finally, we computed factor scores based on the Schmid-Leiman factor analysis, and determined the white-black difference in σ units on each of the factors. These differences are as follows: g, +.77; Verbal, +.20; Spatial, +.39; Memory Span, +.01. In the national standardization sample of the WISC-R, the mean white-black difference on the Full Scale IQ is 1.14 σ, and the difference in g factor scores is the same (Jensen & Reynolds, 1982). When SES is partialled out of the WISC-R standardization data, however, the white-black Full Scale IQ difference is reduced to 0.80 σ (Reynolds & Gutkin, 198l), which is close to the white-black difference of +0.77 on g factor scores in the present white and black samples, which were matched on SES.
Thus, the smaller black-white difference on the K-ABC Mental Processing subtests than on the WISC-R subtests is due to two conditions: (1) the lower average g loadings of the K-ABC Mental Processing subtests (Mean g loading = .429) than of the WISC-R subtests (Mean g loading = .513); and (2) the negative relation between the white-black difference and the greater specificity of the Mental Processing subtests.
It is apparent that a test constructor could manipulate the size of the black-white difference to some extent by selecting or devising subtests that include other factors besides g (because g is unavoidable in any cognitive test) that favor either blacks or whites. In general, it appears that, when g is partialled out, tests that favor blacks over whites are those that are most loaded on a memory factor. Tests that favor whites over blacks are those most loaded (independently of g) on a spatial factor. The non-g factors favoring either whites or blacks are more evenly balanced in the battery of WISC-R subtests than in the K-ABC battery, which contains a preponderance (10 out of 13) of subtests which favor blacks on the non-g factors. If intelligence is conceived of as the general factor, or g, it would seem desirable that the black-white differences on other factors be balanced out or minimized as much as possible. If, however, g is not accepted as the construct criterion of intelligence, some explicit theoretical justification is required if intelligence (and consequently the magnitude of the black-white difference in IQ) is identified as some particular admixture of g with certain other factors that are uncorrelated with g and which, when summed algebraically, may favor either blacks or whites. The g factor is ubiquitous in all mental tests; the non-g factors are not. How can one justify any given mixture of the non-g factors in a battery of tests called an intelligence test if the total score or IQ derived from the battery reflects the non-g factors to such a degree that there is an appreciably less than perfect correlation between IQs and g factor scores on the test?
In conclusion, the lesser black-white difference on the K-ABC than on most other IQ tests is neither a mystery nor is it attributable to any superior psychometric features of the K-ABC. It is the predictable and inevitable result of two psychometric and statistical effects manifested in the K-ABC: (1) lower g loadings of the subtests, especially the Mental Processing subtests on which the IQ is based, and (2) an admixture of other factors besides g, primarily a memory span factor and test specificity, which, independently of g, favor blacks. Whether these conditions in a test are deemed desirable or undesirable depends on whether they enhance or degrade the test’s construct validity as a measure of general intelligence or its practical predictive validity for criteria involving intellectual achievement.
 Because the g loadings derived from a particular battery of tests are not statistically independent of one another and do not qualify as a random sample from a population with an assumed normal distribution, and because the same is true of the standardized mean black-white differences on the tests, the Pearson product-moment coefficient of correlation (r) between g loadings and mean differences, although it is the most precise index of the degree of linear relationship between the two sets of variables, cannot, in a strict sense, be tested for statistical significance. Therefore, significance tests are not here applied to the Pearson r when used as an index of relationship between g loadings and mean black-white differences. However, in addition to the Pearson r, the corresponding Spearman rank-order correlation, p, is also reported, because its level of significance does not rest on any assumptions about the distributional characteristics of the two variates. As a nonparmetric, or distribution-free, test of independence or index of relationship, the rank correlation’s level of significance is simply the proportion p of all possible n! permutations of the n-ranked pairs of variables for which the absolute value of p is equal to or greater than the obtained p.
 The black-white factor scores were obtained in essentially the same manner as the first principal component factor scores. That is, a subject’s factor score is a weighted average of the subtest scores. The age-standardized scores of all 172 subjects on each of the 24 WISC-R and K-ABC subtests were converted to z scores (mean = 0, σ = 1) based on the mean and σ of the age-standardized scores in the total sample. A subject’s factor score (on the black-white factor) is the weighted mean of the 24 subtest scores, with each of the 24 subtest z scores weighted by the point-biserial correlation between the age-standardized subtest scores and the black-white dichotomy, quantitized as 0 and 1, respectively.