Explanation behind the non-g gains in the Flynn Effect : Introducing the measurement invariance model

The phenomenon known as secular IQ gains, or Flynn Effect, is sometimes or perhaps even regularly, viewed as a reason to expect the disappearance of the black-white IQ gap. Not only the logic behind this line of reasoning is a pure non sequitur, it is also disconcerting that the question of g (loadings) and measurement invariance (equivalence) seems to have been underdiscussed.

There are a lot of questions that need to be answered (Rodgers, 1999). To begin with, a cogent criticism of the Flynn Effect (FE) that seems to have gone unnoticed by most researchers is from Jensen (1998, pp. 331-332). He argued that if the gains were real, the later time point (younger cohort) would show underprediction of IQ on a wide variety of criterion measures, relative to the earlier time (older cohort). In other words, when IQ is held constant, recent cohorts would outperform old cohorts on, say, scholastic/achievement tests. This is how Jensen describes the situation :

A definitive test of Flynn’s hypothesis with respect to contemporary race differences in IQ is simply to compare the external validity of IQ in each racial group. The comparison must be based, not on the validity coefficient (i.e., the correlation between IQ scores and the criterion measure), but on the regression of the criterion measure (e.g., actual job performance) on the IQ scores. This method cannot, of course, be used to test the “reality” of the difference between the present and past generations. But if Flynn’s belief that the intergenerational gain in IQ scores is a purely psychometric effect that does not reflect a gain in functional ability, or g, is correct, we would predict that the external validity of the IQ scores, assessed by comparing the intercepts and regression coefficients from subject samples separated by a generation or more (but tested at the same age), would reveal that IQ is biased against subjects from the earlier generation. If the IQs had increased in the later generation without reflecting a corresponding increase in functional ability, the IQ would markedly underpredict the performance of the earlier generation – that is, their actual criterion performance would exceed the level of performance attained by those of the later generation who obtained the same IQ. The IQ scores would clearly be functioning differently in the two groups. This is the clearest indication of a biased test – in fact, the condition described here constitutes the very definition of predictive bias. If the test scores had the same meaning in both generations, then a given score (on average) should predict the same level of performance in both generations. If this is not the case (and it may well not be), the test is biased and does not permit valid comparisons of “real-life” ability levels across generations.

As Williams (2013) noted, “This assumes that the later test has not been renormed. In actual practice tests are periodically renormed so that the mean remains at 100. The result of this recentering is that the tests maintain their predictive validity, indicating that the FE gains are indeed hollow with respect to g”. But also because there is no predictive bias against blacks (i.e., underprediction), before or after correction for unreliability, Jensen concluded that the Flynn Effect (FE) and the B-W IQ gap must have different causes. Thus, Jensen seems to have anticipated most of the conclusions from the psychometric meta-analyses (PMA) of te Nijenhuis (2007, 2012, 2013) and the measurement invariance (MI) model of Wicherts et al. (2004). But the best illustration is provided by Ang et al. (2010) in a longitudinal study (CNLSY79) where it has been observed that racial groups did not differ in FE gain rates.

But before this, Rushton (1999, 2010) conducted earlier a principal component and was able to show that IQ gains have their highest loadings on a different component than inbreeding depression, a pure genetic effect, and g-loadings and black-white differences who loaded on the same component (PC2) as shown below.

The rise and fall of the Flynn Effect as a reason to expect a narrowing of the Black-White IQ gap - Table 1

Table 1 presents the zero-order correlations in the top half of the matrix and the first-order partial correlations (after controlling for reliability) in the lower half of the matrix. As can be seen, inbreeding depression correlated significantly positively with the Black-White differences (r=0.48; P<0.05) but not with the gain scores (mean r=0.13; range=-0.07 to 0.29). Similarly, the g loadings correlated significantly positively with the Black-White differences (0.53, 0.69) but significantly negatively with the gain scores (mean r=-0.33; range=-0.04 to -0.73; P<0.00001, Fisher, 1970, pp. 99-101).

The rise and fall of the Flynn Effect as a reason to expect a narrowing of the Black-White IQ gap - Table 2

Then, Flynn (2000, pp. 202-214) provided a counterweight to Rushton’s analysis. Instead of using a measure of crystallized g (here, the WISC) he has created his measure of fluid-g loadings by using Raven’s matrices, since “Jensen (1998, p 38) asserts that when the g-loadings of tests within a battery are unknown, the correlation of Raven’s with each test is often used; and Raven’s is the universally recognized measure of fluid g.” (p. 207). What he found (Table 3) was a high loading on PC1 for black-white differences and fluid g and IQ gains, while inbreeding depression has its highest loading on a different component, PC2. Unfortunately, Flynn’s result is somewhat questionable. He removed one subtest, Mazes, from the analysis, with apparently no reason given. And further, as he mentioned, his computed fluid g did not correlate with WISC g, “the resulting fluid g hierarchy was compared to those that resulted when Rushton ranked WISC subtests by crystallized g, that is, the g-loadings derived from factor analysis of the WISC-R and WISC-III respectively. The hierarchies were uncorrelated: the values for rs and r ranged from negative 0.10 to positive 0.18.” (p. 207). Given that the the Wechsler test is skewed towards a crystallized g, and that fluid-g loading is just the opposite of crystallized-g loading in the Wechsler (Hu, Jul.5.2013), it is perhaps not surprising that he arrives at a different conclusion than Rushton’s. This point is important because it would be probably more informative to look at the g-loadings within a true fluid test, such as the Raven. In this case, both Rushton and Flynn were wrong. Moreover, Jensen (1998, pp. 90, 120) and Hu (Oct.5.2013) showed that the Raven correlated more with Wechsler’s crystallized tests than fluid tests, meaning that the rank ordering of (subtest) fluid g should mirror (subtest) crystallized g, which is not what we see in Flynn’s figures. Also, Must (2003, p. 470) who found no Jensen effect behind the secular gain in Estonia adds further comment on Flynn. Leaving this aside, Flynn should have noted that those fluid gains were seemingly the product of test-retest and familiarity. As Jensen (1998) mentioned, “These tendencies increase the chances that one or two multiple-choice items, on average, could be gotten “right” more or less by sheer luck. Just one additional “right” answer on the Raven adds nearly three IQ points” (p. 323). This point has been made recently by Alan S. Kaufman (2010a, p. 394; 2010b, pp. 498-502) contra James Flynn (2010b, pp. 413-425).

The item type used in Similarities resembles the age-old questions that teachers have asked children in schools for generations. In contrast, matrices-type items were totally unknown to children or adults of yesteryear and remained pretty atypical for years. Over time, however, this item type has become more familiar to people around the world, especially as tests of this sort have been increasingly used for nonbiased assessment, including for the identification of gifted individuals from ethnic minorities. And, because Raven’s tests can be administered by nonpsychologists, these items tend to be more accessible to the public than are items on Wechsler’s scales, which are closely guarded because of the clinical training that is a requisite for qualified examiners. But go to any major bookstore chain, or visit popular websites, and you can easily find entire puzzle books or pages of abstract matrix analogies.

It is, therefore, difficult to evaluate gains on matrices tasks without correcting these gains for time-of-measurement effects. The power of this “time lag” variable was demonstrated by Owens in his groundbreaking longitudinal study of aging and intelligence. Owens (1953) administered the Army Alpha test in 1950 to 127 men, age 50, who had been administered the same test in 1919 at age 19, when they were freshmen at Iowa State University (initial N = 363). The study continued in 1961 when 96 of these men were tested again, at age 61 (Owens, 1966).

The 96 men tested three times improved in verbal ability between ages 19 and 50 followed by a slight decline from age 50 to 61. On nonverbal reasoning ability, they displayed small increments from one test to the next. However, Owens had the insight to also test a random sample of 19-year-old Iowa State freshmen on the Army Alpha in 1961 to 1962 to permit a time-lag comparison. He was able to use the data from the 19-year-olds to estimate the impact of cultural change on the obtained test scores. When Owens corrected the data for cultural change, the Verbal scores continued to show gains between ages 19 and 61; but what had appeared to be small increments in Reasoning were actually steady decreases in performance.

The time-lag correction may reflect real differences in mental ability (i.e., FE) as well as changes in test-taking ability and familiarity with a particular kind of task. The mere fact of large gains on a test such as Raven’s matrices over several generations, in and of itself, cannot be interpreted unequivocally as an increase in abstract reasoning ability without proper experimental controls. When Flynn has interpreted gain scores for groups of individuals tested generations apart on the identical Raven’s matrices items (e.g., Flynn, 1999, 2009a), he has not controlled for time-of-measurement effects.

Raven increases so much because people today indeed have been much more exposed to visual media and other visual experiences related to modernization of societies. Such phenomenon improves scores through test-content familiarity. It is not surprising then that Raven’s gain scores show measurement bias as well, as stated by Fox & Mitchum (2012). In sum, Raven’s gains exhibit DIF (i.e., bias) in the direction of over-estimation of scores in more recent cohorts. At equal total raw score, older cohorts would infer a greater number of rules. A description of what is a rule is shown below :

Raven's Advanced Progressive Matrices

The correct answer is 5. The variations of the entries in the rows and columns of this problem can be explained by 3 rules.

1. Each row contains 3 shapes (triangle, square, diamond).
2. Each row has 3 bars (black, striped, clear).
3. The orientation of each bar is the same within a row, but varies from row to row (vertical, horizontal, diagonal).

From these 3 rules, the answer can be inferred (5).

Because there are 3 rules, the correct response must contain 3 correct objects. The authors classify the response categories as follows : 1 for no correct objects, 2 for one correct objects, 3 for two correct objects, 4 for three correct objects. We see from their Figure 10, at any given Raven’s raw score, respondents having higher response categories are more likely to be members of older cohorts. Because complexity in the Raven is a function of number and type of rules (Carpenter et al., 1990; Primi, 2001), more complex items involving more rules and more complex rules, it can be inferred that the apparently environmental or cultural effect underlying the Flynn effect is not g-loaded. In any case, the very fact that Piaget tests, another very culture-free tests, display large secular IQ decline (Shayer al., 2007) casts doubt on the argument that FE gains on culture-free tests (e.g., Raven) accredit the views that g has increased. Even Raven’s formidable gain is not observed everywhere. For instance, Raven’s gain was totally absent in Australia, between 1975 and 2003 (Williams, 2013, pp. 2-3). Similarly, Draw-a-Person and Raven CPM show no gain in Brazil between 1980 and 2000 (Bandeira et al., 2012) despite increases in nutrition and general health indices.

The changes in the meaning of IQ tests such as the WISC is also apparent. Kaufman thus noted that changes in procedures and instructions have probably distorted the meaning of scores in such tests that were alike in names only. And he then writes, “scoring system that encourages, rather than discourages, querying children’s incomplete or ambiguous responses”. Regarding Raven-like item gains, Kaufman (2010b) reiterates :

I am not talking about practice effects, the kind of IQ gains that occur over an interval of weeks or months simply because of the experience of having taken the same test before. Rather, I am talking about a cohort effect, one that affects virtually everyone who is growing up during a specific era. In the 1930s, matrices tests were largely unknown and children or adults who would have been administered such tests would have found them wholly unfamiliar. A whole society would have performed relatively poorly on such test items because of their unusualness. By the 1950s, such tests would have been known by some, not many, and by the 1990s and 2000s, matrices tests and similar item styles proliferate and are accessible to everyone. Therefore, it is feasible that people would score higher on a Raven test from one generation to the next simply because the construct measured by the test would have been a bit different from one decade to the next. Such time-of-measurement or time lag cohort effects exert powerful influences in cross-sectional and longitudinal studies of IQ and aging (Kaufman, 2001b; Owens, 1966) and must be controlled when evaluating true changes in ability between early adulthood and old age.

These time lag effects include both instrumentation and real FE gains in IQ. It is the instrumentation aspect of cohort effects that needs to be controlled in FE studies to determine which aspect of the gain is “real” and which aspect concerns the familiarity of the test.

Needless to say, all this confirms the findings of a lack of measurement invariance. Applying the MGCFA on several data sets, Wicherts et al. (2004, pp. 529-532, and pp. 512-513 for a summary of models) came to that conclusion. The cognitive difference between cohorts are not comparable, that is, not reflecting a difference in common factors, due to lack of measurement invariance. An observed score is not measurement invariant when two persons having the same latent ability (i.e., being identical on the construct(s) being measured) have different probabilities of attaining the same score on the test (i.e., they have different expected test scores). In the case of measurement invariance, we expect that these two persons woud have the same item score. But in the case of no measurement invariance, we expect to see systematic differences in scores on the biased item. When measurement bias is detected, the observed scores depend, at least partially, on group membership (cohort, race/ethinicity, gender…). Note that measurement invariance can be investigated at the subscale level or item level (p. 530). Concretely, this would translate into intercept differences (between groups) which imply uniform bias with respect to groups. Obviously, higher test sophistication or different test-taking strategies are such sources of bias resulting in inflated scores in one group but not the other. As Brand (1987) earlier points out, “it is perhaps not surprising if they now record gains as education takes a less meticulous form in which speed and intelligent guessing receive encouragement in the classroom”. Then, Wicherts explains the consequences in these terms :

Conversely, if factorial invariance is untenable, the between-group differences cannot be interpreted in terms of differences in the latent factors supposed to underlie the scores within a group or cohort. This implies that the intelligence test does not measure the same constructs in the two cohorts, or stated otherwise, that the test is biased with respect to cohort. If factorial invariance is not tenable, this does not necessarily mean that all the constituent IQ subtests are biased.

Intercept differences would mean that, in the case of two persons being equated in latent abilities (e.g., mathematical or verbal abilities), the factor loading of one (or more) subtest(s) on this latent factor is not proportionally related to the standardized difference on this subtest(s), meaning that “mean group differences on the subtests should be collinear with the corresponding factor loading” (Wicherts & Dolan, 2010, Figure 3).

While it has been found that measurement invariance holds when it comes to black-white differences, (Dolan, 2000; Dolan & Hamaker, 2001; Lubke et al., 2003), the FE gains show measurement bias. Now, to better understand what the violation of measurement invariance implies, let’s read Mingroni (2007, p. 812) in simple terms :

For example, Wicherts (personal communication, May 15, 2006) cited the case of a specific vocabulary test item, terminate, which became much easier over time relative to other items, causing measurement invariance to be less tenable between cohorts. The likely reason for this was that a popular movie, The Terminator, came out between the times when the two cohorts took the test. Because exposure to popular movie titles represents an aspect of the environment that should have a large nonshared component, one would expect that gains caused by this type of effect should show up within families. Although it might be difficult to find a data set suitable for the purpose, it would be interesting to try to identify specific test items that display Flynn effects within families. Such changes cannot be due to genetic factors like heterosis, and so a heterosis hypothesis would initially predict that measurement invariance should become more tenable after removal of items that display within-family trends. One could also look for items in which the heritability markedly increases or decreases over time. In the particular case cited above, one would also expect a breakdown in the heritability of the test item, as evidenced, for example, by a change in the probability of an individual answering correctly given his or her parents’ responses.

Because of such (cultural) influences, the older cohorts will be disadvantaged in some items, subtests. Obviously, the (inflated) score of younger cohort does not reflect increases in latent ability (we speak of latent trait because it is not directly measurable). In some instances, factorial invariance could be considered as a test of cultural bias. Concretely, if the models with equality constraints on the measurement intercepts show poor fit, the conclusion to be made would be that differences in observed scores are not differences in latent scores. Models 4a and 4b, in the below figure, stand for strict factorial invariance and strong factorial invariance, respectively. Wicherts et al.’ findings (2004) can be summarized as follows :

… The results of the MGCFAs indicated that the present intelligence tests are not factorially invariant with respect to cohort. This implies that the gains in intelligence test scores are not simply manifestations of increases in the constructs that the tests purport to measure (i.e., the common factors). Generally, we found that the introduction of equal intercept terms (N1=N2; Models 4a and 4b; see Table 1) resulted in appreciable decreases in goodness of fit. This is interpreted to mean that the intelligence tests display uniform measurement bias (e.g., Mellenbergh, 1989) with respect to cohort. The content of the subtests, which display uniform bias, differs from test to test. On most biased subtests, the scores in the recent cohort exceeded those expected on basis of the common factor means. This means that increases on these subtests were too large to be accounted for by common factor gains. This applies to the Similarities and Comprehension subtests of the WAIS, the Geometric Figures Test of the BPP, and the Learning Names subtest of the RAKIT. However, some subtests showed bias in the opposite direction, with lower scores in the second cohorts than would be expected from common factor means. This applies to the DAT subtests Arithmetic and Vocabulary, the Discs subtest of the RAKIT, and several subtests of the Estonian NIT. Although some of these subtests rely heavily on learned content (e.g., Information subtest), the Discs subtest does not.

Once we accommodated the biased subtests, we found that in four of the five studies, the partial factorial invariance models fitted reasonably well. The common factors mean that the differences between cohorts in these four analyses were quite diverse. In the WAIS, all common factors displayed an increase in mean. In the RAKIT, it was the nonverbal factor that showed gain. In the DAT, the verbal common factor displayed the greatest gain. However, the verbal factor of the RAKIT and the abstract factor of the DAT showed no clear gains. In the BPP, the single common factor, which presumably would be called a (possibly poor) measure of g, showed some gain. Also in the second-order factor model fit to the WAIS, the second-order factor (again, presumably a measure of g) showed gains. However, in this model, results indicated that the first-order perceptual organization factor also contributed to the mean differences. …

Generally speaking, there are a number of psychometric tools that may be used to distinguish true latent differences from bias. It is notable that with the exception of Flieller (1988), little effort has been spent to establish measurement invariance (or bias) using appropriate statistical modeling. The issue whether the Flynn effect is caused by measurement artifacts (e.g., Brand, 1987; Rodgers, 1998) or by cultural bias (e.g., Greenfield, 1998) may be addressed using methods that can detect measurement bias and with which it is possible to test specific hypothesis from a modeling perspective. Consider the famous Brand hypothesis (Brand, 1987; Brand et al., 1989) that test-taking strategies have affected scores on intelligence tests. Suppose that participants nowadays more readily resort to guessing than participants in earlier times did, and that this strategy results in higher scores on multiple-choice tests. A three-parameter logistic model that describes item responses is perfectly capable of investigating this hypothesis because this model has a guessing parameter (i.e., lower asymptote in the item response function) that is meant to accommodate guessing. Changes in this guessing parameter due to evolving test-taking strategies would lead to the rejection of measurement invariance between cohorts. Currently available statistical modeling is perfectly capable of testing such hypotheses.

… Here, we use results from Dolan (2000) and Dolan and Hamaker (2001), who investigated the nature of racial differences on the WISC-R and the K-ABC scales. We standardized the AIC values of Models 1 to 4a within each of the seven data sets to compare the results of the tests of factorial invariance on the Flynn effects and the racial groups. These standardized AIC values are reported in Fig. 2.

Are intelligence tests measurement invariant over time - Investigating the nature of the Flynn effect - Figure 2

As can be seen, the relative AIC values of the five Flynn comparisons show a strikingly similar pattern. In these cohort comparisons, Models 1 and 2 have approximately similar standardized AICs, which indicates that the equality of factor loadings is generally tenable. A small increase is seen in the third step, which indicates that residual variances are not always equal over cohorts. However, a large increase in AICs is seen in the step to Model 4a, the model in which measurement intercepts are cohort invariant (i.e., the strict factorial invariance model). The two lines representing the standardized AICs from both B–W studies clearly do not fit this pattern. More importantly, in both B–W studies, it is concluded that the measurement invariance between Blacks and Whites is tenable because the lowest AIC values are found with the factorial invariance models (Dolan, 2000; Dolan & Hamaker, 2001). This clearly contrasts with our current findings on the Flynn effect. It appears therefore that the nature of the Flynn effect is qualitatively different from the nature of B–W differences in the United States. Each comparison of groups should be investigated separately. IQ gaps between cohorts do not teach us anything about IQ gaps between contemporary groups, except that each IQ gap should not be confused with real (i.e., latent) differences in intelligence. Only after a proper analysis of measurement invariance of these IQ gaps is conducted can anything be concluded concerning true differences between groups.

Whereas implications of the Flynn effect for B–W differences appear small, the implications for intelligence testing, in general, are large. That is, the Flynn effect implies that test norms become obsolete quite quickly (Flynn, 1987). More importantly, however, the rejection of factorial invariance within a time period of only a decade implies that even subtest score interpretations become obsolete. Differential gains resulting in measurement bias, for example, imply that an overall test score (i.e., IQ) changes in composition.

Using MGCFA as well, on ACT and EXPLORE (two cognitive tests), Wai & Putallaz (2011) come up with another rejection in factorial invariance for the secular gains on a very large sample size.

For example, for tests that are most g loaded such as the SAT, ACT, and EXPLORE composites, the gains should be lower than on individual subtests such as the SAT-M, ACT-M, and EXPLORE-M. This is precisely the pattern we have found within each set of measures and this suggests that the gain is likely not due as much to genuine increases in g, but perhaps is more likely on the specific knowledge content of the measures. Additionally, following Wicherts et al. (2004), we used multigroup confirmatory factor analysis (MGCFA) to further investigate whether the gains on the ACT and EXPLORE (the two measures with enough subtests for this analysis) were due to g or to other factors. 4

4. … Under this model the g gain on the ACT was estimated at 0.078 of the time 1 SD. This result was highly sensitive to model assumptions. Models that allowed g loadings and intercepts for math to change resulted in Flynn effect estimates ranging from zero to 0.30 of the time 1 SD. Models where the math intercept was allowed to change resulted in no gains on g. This indicates that g gain estimates are unreliable and depend heavily on assumptions about measurement invariance. However, all models tested consistently showed an ACT g variance increase of 30 to 40%. Flynn effect gains appeared more robust on the EXPLORE, with all model variations showing a g gain of at least 30% of the time 1 SD. The full scalar invariance model estimated a gain of 30% but showed poor fit. Freeing intercepts on reading and English as well as their residual covariance resulted in a model with very good fit: χ² (7) = 3024, RMSEA = 0.086, CFI = 0.985, BIC = 2,310,919, SRMR = 0.037. Estimates for g gains were quite large under this partial invariance model (50% of the time 1 SD). Contrary to the results from the ACT, all the EXPLORE models found a decrease in g variance of about 30%. This demonstrates that both the ACT and EXPLORE are not factorially invariant with respect to cohort … gains may still be due to g in part but due to the lack of full measurement invariance, exact estimates of changes in the g distribution depend heavily on complex partial measurement invariance assumptions that are difficult to test. Overall the EXPLORE showed stronger evidence of potential g gains than did the ACT.

Using the same technique again, Must et al. (2009) found a lack of invariance in the estonian Flynn Effect, for the comparability between three cohorts (1933/36, 1997/98, et 2006). Equality of intercepts resulted once again in poor fit. Their results :

Clearly the g models of 1933/36 and 2006 differ by regression intercepts (Table 7). In all three comparisons the subtests A5 (Symbol–Number) and B5 (Comparisons) have different intercepts. In two comparisons from three subtests A1 (Arithmetic), B1 (Computation), B2 (Information), and B3 (Vocabulary) regression intercepts were not invariant. It is evident that in 2006 the subtest A5 and B5 do not have the same meaning they had in 1933/36. The comparison of the cohorts on the bases of those subtests will give “hollow” results. The conclusions about gains based on the subtest A1, B1, B2, and B3 should also be made with caution.

In the initial stage (model 4), models testing the equality of intercepts yielded bad fit estimations. Table 7 shows, for instance, that comparing data from 1933/36 and 2006 using data from older children yielded values of RMSEA=.129, and CFI=.865. Thus, it can be concluded, that when comparing the data from 1933/36 and 2006 there are some minimal differences in factor loadings, but the main and significant differences are in regression intercepts. This means, first of all, that students at the same level of general mental ability (g) from different cohorts have different manifest test scores: g has different impact on the performance of students in different subtests in different cohorts making some subtests clearly easier for later cohorts.

And their discussion is worth quoting in full, because it helps to have a better grasp on the high complexity underlying the secular IQ gains :

Six NIT subtests have clearly different meaning in different periods. The fact that the subtest Information (B2) has got more difficult may signal the transition from a rural to an urban society. Agriculture, rural life, historical events and technical problems were common in the 1930s, such as items about the breed of cows or possibilities of using spiral springs, whereas at the beginning of the 21st century students have little systematic knowledge of pre-industrial society. The fact that tasks of finding synonyms–antonyms to words (A4) is easier in 2006 than in the 1930s may result from the fact that the modern mind sees new choices and alternatives in language and verbal expression. More clearly the influence of language changes was revealed in several problems related to fulfilling subtest A4 (Synonyms–Antonyms). In several cases contemporary people see more than one correct answer concerning content and similarities or differences between concepts. It is important that in his monograph Tork (1940) did not mention any problems with understanding the items. It seems that language and word connotations have changed over time.

The sharp improvement in employing symbol–number correspondence (A5) and symbol comparisons (B5) may signal the coming of the computer game era. The worse results in manual calculation (B1) may be the reflection of calculators coming in everyday use.

From this, they conclude : “With lack of invariance of the g factor, overall statements about Flynn Effects on general intelligence are unjustified”. Besides, their Table 4 is particularly enlightening. The mean intercorrelation of NIT subtests shows a stark decline from the 1933 cohort to the 1997 or 2006 cohort. Furthermore, the difference in the g-factor loadings between the 1933 and 1997 (or 2006) cohort, as assessed using F-statistic, is significant. Both of these strongly suggest a decline in g, and thus the hollowness of FE gains in Estonia, even if the FE gains continue to show a positive trend when looking at the IRT scores (Shiu & Beaujean, 2013, Table 3).

Later on, Must (2013, pp. 7-9) also found in Estonia that changes in test-taking strategies were involved behind the Flynn Effect. The decline in missing answers were accompanied by a rise in wrong and correct answers. This suggests that the role of guessing in completion of tests has become more important over time.

In the period 1933/36-2006 mean subtest results of comparable age-cohorts have changed (Table 2). There is a general pattern that the frequency of missing answers in NIT subtests is diminished (approximately 1 d), with the exception of the subtest B1 (Computation), where the rise in missing answers was 0.36 d. The rise of right answers is evident in most of the subtests (7 from 9). The mean rise of right answers per subtest is about .86 d. The frequency of wrong answers rose as well. The mean rise effect of wrong answers (.30 d) is smaller than the mean rise in right answers, but it is also evident in 7 of the 9 subtests. In the FE framework it is important to note that the diminishing number of missing answers is offset by, not only right answers, but wrong answers as well.

Over time the general relationship between right, wrong and missing answers has changed.

One of the clearest findings in both cohorts is that instead of right answer there are missing answers. This correlation between the number of correct answers and missing answers was more apparent in 1933/36 (r = -.959, p < .001) than in 2006 (r = -.872, p < .001). In 1933/36 the number of wrong answers did not correlate with the number of right answers (r = -.005), but in 2006 the frequency of wrong answers moderately indicates a low number of right answers (r = -.367, p < .001). In both cohorts the number of missing answers is negatively correlated with wrong answers, but the relationship is stronger in the 2006 cohort (r = -.277, p < .001; in 1933/36 r = -.086, p = .01). The cohort differences between the above presented correlations across cohorts are statistically significant.

Besides, Must (2013, Figure 1) also shows that after adjustment for wrong answers, the effect size is reduced among subtests, although at different degrees. His Table 4 displays the changes of odds to give a right, wrong or missing answer at the item level. The odds of both right and wrong answers were rising. The changes in response structure is further evidenced (Table 5) by the correlation between wrong answers and order of items (i.e., odds of giving a wrong answer at the end of a subtest). This relationship displays no clear pattern for the 1933/36 cohort (mean rho = 0.086) while it was positive for the 2006 cohort (mean rho = 0.426). As Must mentioned, “The 2006 cohort tried to solve more items, but in a more haphazard way than did the cohort of 1933/36. The main difference between cohorts is the test-taking speed. But speed has its price – the more items that students tried to solve, the higher the probability of answering incorrectly as well (Table 5)”. The subtests are organized according to an increasing scale of difficulty, and so, the attempt to answer more items results in more errors “especially so if the items required attention and thought or the test-takers are hurrying towards the end of test”. Needless to say, all this is wholly consistent with the fact that IQ scores over time are not measurement invariant across cohorts, unlike black-white differences.

Another evidence for a violation of measurement invariance is from Beaujean (2008) by means of Item Response Theory (IRT) model, to specify how individual latent abilities and item properties (a : difficulty; b : discrimination; c : guessing the correct answer) are related to how the subject responds to the (set of) item(s). This allows us to discern a difference between raising score or changing item properties, or an interaction of the two. Differential item functioning (DIF) would occur when those item parameters differ between groups (race, gender, cohort). Such would be the case when, for instance, two persons (from different group) with equal ability have different probability of answering the item correctly.

It seems, historically, that IRT was known as latent trait theory or, as Jensen was used to say, Item Characteristic Curve (ICC). Jensen (1980, p. 443 and following) gives the description of such analyses : “If the test scores measure a single ability throughout their full range, and if every item in the test measures this same ability, then we should expect that the probability of passing any single item in the test will be a simple increasing monotonic function of ability, as indicated by the total raw score on the test.” (p. 442). No measurement bias is detected if item equivalence holds, with parameters (a, b, c) of the ICC being invariant across groups. Here is Jensen’s explanation of the concept :

Hence, a reasonable statistical criterion for detecting a biased item is to test the null hypothesis of no difference between the ICCs of the major and minor groups. In test construction, the items that show a significant group difference in ICCs should be eliminated and new ICCs plotted for all the remaining items, based on the total raw scores after the biased items have been eliminated. The procedure can be reiterated until all the biased items have been eliminated. The essential rationale of this ICC criterion of item bias is that any persons showing the same ability as measured by the whole test should have the same probability of passing any given item that measures that ability, regardless of the person’s race, social class, sex, or any other background characteristics. In other words, the same proportions of persons from each group should pass any given item of the test, provided that the persons all earned the same total score on the test. In comparing the ICCs of groups that differ in overall mean score on the test, it is more accurate to plot the proportion of each group passing the item as a function of estimated true scores within each group (rather than raw scores on the test), to minimize group differences in the ICCs due solely to errors of measurement.

In other words, an item that displays DIF is removed. All the remaining items not showing DIF are used to compare (latent) scores between groups. When IRT scores are used, as shown in Table 3, the FE gains are reduced to almost nothing (0.06 SD over 14 years).

Using Item Response Theory to assess the Flynn Effect in the National Longitudinal Study of Youth 79 Children and Young Adults data - Table 3

The results from the PPVT-R analysis are shown in Table 2, with the columns labeled IRT being the derived IRT latent trait scores. As with the PIAT-Math scores, Cohen’s (1988) d (with a pooled standard deviation) was calculated for all score types to facilitate comparison (see Table 3). Like the PIAT-Math, the raw, standardized, and percentile scores show an increase over time of the magnitude of .13, .41, and .48 standard deviations, but the IRT scores show a negligible increase over time of the magnitude of .06. This pattern is generally repeated when the data are grouped by age, when the n is of appreciable size.

Beaujean (2010, Figure 1) also used IRT model for assessing the FE gains in the Wordsum (vocabulary) test in the GSS. Using IRT scores, it has been seen that there are less (verbal) IQ changes over time. This shows again the importance of considering the issue of item bias when comparing IQs between groups.

Rejection of measurement equivalence makes it difficult to draw inferences based on whatever g-based theory (e.g., Spearman hypothesis) because it means that differences in score are not comparable. According to Kanaya & Ceci (2010), when children were tested on an old version (norm) of the Wechsler (WISC-R) and retested on a newer version (WISC-III) supposed to be more difficult, significant IQ decline had been observed as compared to the group being tested and retested on WISC-R. The group tested and retested on the new norm WISC-III was not so different from the group tested and retested on the old norm WISC-R (Table 3) after controlling for age, initial IQ and practice effect (i.e., with time between tests).

Actually, the empirical literature regarding test bias is coherent with the idea that the real intelligence is not rising over time (see Gottfredson’s comments; 2007, 2008, p. 560). The correlations of subtest g-loadings with subtest gains show that the Flynn Effect is hollow with respect to g. Using Jensen’s method of correlated vectors (MCV), several studies have investigated the issue, yielding sometimes positive correlation (r) between g-loadings (g) and secular gains (d) and sometimes negative (or no) correlation (at all). Dolan (2000; & Hamaker 2001) and others (Ashton, 2005) have criticized MCV. Jan te Nijenhuis (2007, pp. 287-288) countered by arguing that psychometric meta-analytic (PMA) methods, corrected for artifacts (e.g., sampling error, restriction of range of g loadings, reliability of the vector of score gains and the vector of g loadings, correction for deviation from perfect construct validity), could improve Jensen’s MCV, yielding more accurate results. But keep in mind that MCV+PMA cannot test for possible bias in cognitive tests. MGCFA and IRT are more suitable for that purpose. In any case, the uncorrected (r) and corrected (p) correlations found by te Nijenhuis (2007; see also, 2012, 2013) were respectively -0.81 and -0.95, as shown in Table 2. The fact that those artifacts explained 99% of the variance in effect sizes means that other plausible moderators such as sample age, IQ-sample or test-retest interval, test type, play no role.

Score gains on g-loaded tests - No g - Table 2

The large number of data points and the very large sample size indicate that we can have confidence in the outcomes of this meta-analysis. The estimated true correlation has a value of -.95 and 81% of the variance in the observed correlations is explained by artifactual errors. However, Hunter and Schmidt (1990) state that extreme outliers should be left out of the analyses, because they are most likely the result of errors in the data. They also argue that strong outliers artificially inflate the S.D. of effect sizes and thereby reduce the amount of variance that artifacts can explain. We chose to leave out three outliers – more than 4 S.D. below the average r and more than 8 S.D. below ρ – comprising 1% of the research participants.

This resulted in no changes in the value of the true correlation, a large decrease in the S.D. of ρ with 74%, and a large increase in the amount of variance explained in the observed correlations by artifacts by 22%. So, when the three outliers are excluded, artifacts explain virtually all of the variance in the observed correlations. Finally, a correction for deviation from perfect construct validity in g took place, using a conservative value of .90. This resulted in a value of -1.06 for the final estimated true correlation between g loadings and score gains. Applying several corrections in a meta-analysis may lead to correlations that are larger than 1.00 or -1.00, as is the case here. Percentages of variance accounted for by artifacts larger than 100% are also not uncommon in psychometric meta-analysis. They also do occur in other methods of statistical estimation (see Hunter & Schmidt, 1990, pp. 411-414 for a discussion).

Earlier, some researchers investigated the issue of Spearman’s law of diminishing returns. It has been reported that the Flynn Effect originated from an increase in specific, not general abilities. For instance, using the WAIS, WAIS-R, WAIS-III, Kane (2000, p. 565; & Oakland, 2000, p. 343) found that the subtest intercorrelations (or positive manifold) had decreased over time, which induces him to conclude that “lower test intercorrelations may not reflect a diminishment in the importance of g per se, but rather the improvement of specific primary cognitive abilities”, as one explanation (see p. 565, for further discussion). Juan-Espinosa (2006), on its part, says the following :

The indifferentiation hypothesis has relevant practical implications. First of all, it can be assumed that Wechsler batteries are measuring the same g factor across all age groups. This being true, social correlates of the Wechsler’s scales as the prediction of the educational achievement or the likelihood of being out from the school (e.g., Neisser, Boodoo, Bouchard, Boykin, Brody, Ceci, Halpern, Loehlin, Perloff, Sternberg and Urbina, 1996) would be mostly due to the g factor (Jensen, 1998). However, the same cannot be said of the comparison across cohorts. The youngest cohorts depend more on non-g factors to achieve a higher performance level.

Similar findings were previously reported by Lynn & Cooper (1993, 1994). This outcome of course was inescapable insofar as Flynn Effect results in increases in specific abilities rather than general abilities. By the same token, the specialization and differentiation of abilities, appears wholly consistent with Woodley’s cognitive differentiation-integration effort (CD-IE) hypothesis (2011a, 2011b). Low-IQ people are more dependent on g than those with higher IQ, who can rely on a wider array of abilities. Here is, in short, how Woodley describes his model :

The tradeoff concerns two hypothetical types of effort – cognitive integration effort (CIE), associated with a strengthening of the manifold via the equal investment of bioenergetic resources (such as time, calories and cognitive real estate) into diverse abilities, and cognitive differentiation effort (CDE), associated with a weakening of the manifold via the unequal investment of resources into individual abilities.

For instance, while many researchers believe that nutrition and/or education, among other things, are behind the Flynn Effect, Woodley (2011a) tells us that any factor reducing diseases, mortality, or improving health, adequate nourishment, education, environment as a whole, would permit the development of differentiated abilities and consequently the hollow gains in the Flynn Effect. Furthermore, given that increases in height (an indicator of good health and nutrition) are concentrated toward the upper (height) distribution and FE gains toward the lower (IQ) distribution, in Norway (Sundet al., 2004, Figure 4), and that FE gains are not correlated with height gains, in Sweden (Rönnlund et al., 2013, p. 23), there is a little doubt that improvement in nutrition was behind the FE gains in developed countries. The FE gains seem to have stopped (Williams, 2013, pp. 2-3) in developed countries and they even showed sometimes a reversal (Teasdale & Owen, 2008; Shayer & Ginsburg, 2007). Interestingly, this anti-Flynn Effect could be correlated with g-loadings (Woodley & Meisenberg, 2013). Four successful tests of the CD-IE effect (Woodley et al., 2013) show that it shares with the Flynn Effect the same psychometric properties, namely the absence of Jensen Effect, making it once again a candidate for the explanation of the secular gains. Woodley and Madison (2013), using Must et al. (2009) estonian data, demonstrated that FE gains and changes in g-loadings (Δg) between measurement occasions displayed a robustly negative correlation. Again, this provides evidence that FE gains and general abilities are not related.

Although not relevant to g, there is still some controversy about the role of education achievement behind the secular gains; Lynn (2009, pp. 18-19) points out that, because the developmental gains (DQs) and IQ gains of preshool children and school aged children were of equal magnitude, it makes extremely difficult to give it a significant role for education, or test-taking strategy. The social multipliers theory of Dickens and Flynn is also rejected because it would have predicted small gains among infants and preschool children, with gains progressively going up to adulthood due to cumulative effects, which did not occur. According to Lynn, the most plausible factor comes from a common cause, namely, an improvement in nutrition. On the other hand, the IRT test of Pietschnig et al. (2013) suggests a certain role for education. The same authors have also demonstrated that the secular gains were due to an improvement in the lower end of the IQ distribution, vindicating Rodger’s hypothesis and apparently the nutrition theory.

Given what is said in the above paragraph, the nutrition theory does not require perfect correlation between height and IQ changes, as Lynn maintained “the nutrition theory of the secular increase of intelligence does not require perfect synchrony between increases in height and intelligence. There appear to be some micronutrients the lack of which does not adversely affect height but adversely affects the development of the brain and intelligence”. Furthermore, Lynn (2009) believes that the nutrition theory can easily explain why fluid IQ increased more than crystallized IQ because “Several studies have shown that sub-optimal nutrition impairs fluid intelligence more than crystallized intelligence (e.g. Lundgren et al., 2003), while nutritional supplements given to children raise their fluid IQs more than their crystallized IQs (Benton, 2001; Lynn & Harland, 1998; Schoenthaller, Bier, Young, Nichols, & Jansenns, 2000).” (p. 21). Jensen (1998, p. 320) however makes it clear that a (large) increase in highly g-loaded tests, such as the Raven, does not mean that the g-loadings of the test will also display a significant relationship with the secular gains (see for instance, Metzen 2012). The very fact that groups of children differing in mother’s education and family income as well as urbanization do not differ in FE gain rates in longitudinal data attenuates somewhat the nutrition theory (Ang et al., 2010). Additionally, Flynn (2009) noted that, in Britain, Raven’s CPM gains were larger for high-SES children (at all age categories between 5 and 11) in the period of 1938-2008. Nonetheless, Raven’s SPM gains show no clear-cut pattern, because gains were larger among high-SES children of 5-9 years-old but became weaker than low-SES children 9-15 years-old. This does not square with the nutrition history.

Considering all the evidence regarding measurement bias in paper-pencil IQ tests, one would wonder if the FE would show up in other kinds of cognitive tests, such as reaction time test. There is however only one study that has investigated this issue. And this study (Nettelbeck & Wilson, 2004), although with small Ns, demonstrates the absence of FE gains in IT despite improvement in PPVT (a highly culture-loaded test). The author writes “Despite the Flynn effect for vocabulary achievement, Table 1 demonstrates that there was no evidence of improvement in IT from 1981 (overall M= 123±87 ms) to 2001 (M = 116±71 ms).”. The hollow gain underlying the Flynn Effect is further vindicated. Furthermore, Woodley et al. (2013) found evidence of slowering RT.

The fact that score differences between cohorts are not directly comparable and are hollow with respect to g has some serious consequences for Dickens and Flynn (2001) model which has been extensively discussed (Loehlin, 2002; Rowe & Rodgers, 2002; Mingroni, 2007; Dickens & Flynn, 2002, Dickens, 2009). The theory implies that even minor differences in inherited abilities (e.g., talent, intelligence, …) could develop into major differences through social or environmental multipliers. The rationale goes like this : if a person was initially genetically advantaged in athletics, this person will have an inclination for sport practices. He will be motivated by the sort of tasks he performs well, which allows him to maximize his genetic potential, and thus his later performance, and this in turns gives him even more motivation. The better he gets, the more he enjoys the activity. This positive feedback is supposed to be the explanation of the increase in intra- and inter-group differences. Another important detail, as they note, “it is not only people’s phenotypic IQ that influences their environment, but also the IQs of others with whom they come into contact” (p. 347).

Their model, as they say (2001, pp. 347-349), because it simply suggests that genetically advantaged people in a particular trait will become matched with superior environments for that trait, does not necessitate the hypothesis of any factor X operating to depress one group but not the others. Studies indeed show that such hypothesis must be rejected (Rowe et al., 1994, 1995; Rowe & Cleveland, 1996; Lubke et al., 2003). This is why Dickens (2005, p. 64) argue that “we might expect that persistent environmental differences between blacks and whites, as well as between generations, could cause a positive correlation between test score heritabilities and test differences” because their model implies that the more is the initial (physical) advantage and the more is the environmental influence on that trait. Because their model depends on the valid comparison of scores between cohorts and the existence of g-gains, the theory is fundamentally flawed. By the same token, we can also see where Flynn (2010a, p. 364) got it wrong :

Originally, Jensen argued: (1) the heritability of IQ within whites and probably within blacks was 0.80 and between-family factors accounted for only 0.12 of IQ variance — with only the latter relevant to group differences; (2) the square root of the percentage of variance explained gives the correlation between between-family environment and IQ, a correlation of about 0.33 (square root of 0.12=0.34); (3) if there is no genetic difference, blacks can be treated as a sample of the white population selected out by environmental inferiority; (4) enter regression to the mean — for blacks to be one SD below whites for IQ, they would have to be 3 SDs (3×.33=1) below the white mean for quality of environment; (5) no sane person can believe that — it means the average black cognitive environment is below the bottom 0.2% of white environments; (6) evading this dilemma entails positing a fantastic “factor X”, something that blights the environment of every black to the same degree (and thus does not reduce within-black heritability estimates), while being totally absent among whites (thus having no effect on within-white heritability estimates).

I used the Flynn Effect to break this steel chain of ideas: (1) the heritability of IQ both within the present and the last generations may well be 0.80 with factors relevant to group differences at 0.12; (2) the correlation between IQ and relevant environment is 0.33; (3) the present generation is analogous to a sample of the last selected out by a more enriched environment (a proposition I defend by denying a significant role to genetic enhancement); (4) enter regression to the mean — since the Dutch of 1982 scored 1.33 SDs higher than the Dutch of 1952 on Raven’s Progressive Matrices, the latter would have had to have a cognitive environment 4 SDs (4×0.33=1.33) below the average environment of the former; (5) either there was a factor X that separated the generations (which I too dismiss as fantastic) or something was wrong with Jensen’s case. When Dickens and Flynn developed their model, I knew what was wrong: it shows how heritability estimates can be as high as you please without robbing environment of its potency to create huge IQ gains over time.

The logic here is not correct because Flynn compares biased IQ differences of cohort groups while Jensen compared unbiased IQ differences of racial groups. Although Flynn had nothing to say on measurement equivalence, he denied g as a valid argument against the phenomenon of secular gains :

You cannot dismiss the score gains of one group on another merely because the reduction of the score gap by subtest has a negative correlation with the g loadings of those subtests. In the case of each and every subtest, one group has gained on another on tasks with high cognitive complexity. Imagine we ranked the tasks of basketball from easy to difficult: making lay-ups, foul shots, jump shots from within the circle, jump shots outside the circle, and so on. If a team gains on another in terms of all of these skills, it has closed the shooting gap between them, despite the fact that it may close gaps less the more difficult the skill. Indeed, when a worse performing group begins to gain on a better, their gains on less complex tasks will tend to be greater than their gains on the more complex. That is why black gains on whites have had a (mild) tendency to be greater on subtests with lower g loadings.

Reverting to group differences at a given time, does the fact that the performance gap is larger on more complex then easier tasks tell us anything about genes versus environment? Imagine that one group has better genes for height and reflex arc but suffers from a less rich basketball environment (less incentive, worse coaching, less play). The environmental disadvantage will expand the between-group performance gap as complexity rises, just as much as a genetic deficit would. I have not played basketball since high school. I can still make 9 out of 10 lay-ups but have fallen far behind on the more difficult shots. The skill gap between basketball “unchallenged” players and those still active will be more pronounced the more difficult the task. In sum, someone exposed to an inferior environment hits what I call a “complexity ceiling”. Clearly, the existence of this ceiling does not differentiate whether the phenotypic gap is due to genes or environment.

While Flynn was perfectly right that a low-g person would improve more on less g-loaded items (i.e., less complex) than on the more g-loaded items, his analogy is defectuous. As Chuck (Feb.17.2011) pointed out :

We could use a basketball analogy to capture both positions on this matter. Flynn argues that g is analogous to general basketball ability; it’s important because it correlates with the ability to do complex moves, say like making reverse two-handed dunks. Flynn’s point is that to do a reverse two-handed dunk, one needs to learn all the basic moves. Since environmental disadvantages (poor coaches, limited practicing space, etc.) handicap one when it comes to basic moves, they necessarily handicap one more when it comes to complex basketball moves. Rushton and Jensen argue the g is analogous to a highly heritable athletic quotient; it’s important because it correlates with basic physiology, generalized sports ability, and basic eye-motor coordination. Their point is that it’s implausible that disadvantages in basketball training would lead to across the board disadvantages in all athletic endeavors and, moreover, lead to a larger handicap in general athleticism than to a handicap in basic basketball ability. Rather than disadvantages in basketball training leading to disadvantages in general athletic ability, it’s much more plausible that disadvantages in general athletic ability would lead to a reduced effectiveness of basketball training.

Flynn and other environmentalists can only circumnavigate g by insisting that a web of g affecting environmental circumstances, in effect, constructs g from the outside in. Given that g is psychometrically structurally similar across populations, sexes, ages, and cultures this seems implausible as it would necessitate that either everyone happened to encounter the same pattern of g formative environmental circumstances just at different levels of intensity or that environmental circumstances were themselves intercorrelated.

The lack of practice in one domain will surely affect this single domain more than it affects abilities in all domains of sports. The reverse is true. Practice in one domain will affect more this single domain than all domains in sports. This is how Murray (2005, fn. 71) puts it :

An athletic analogy may be usefully pursued for understanding these results. Suppose you have a friend who is a much better athlete than you, possessing better depth perception, hand-eye coordination, strength, and agility. Both of you try high-jumping for the first time, and your friend beats you. You practice for two weeks; your friend doesn’t. You have another contest and you beat your friend. But if tomorrow you were both to go out together and try tennis for the first time, your friend would beat you, just as your friend would beat you in high-jumping if he practiced as much as you did.

This is best illustrated by Jensen (1998) who said that a g-loaded effect should be generalizeable, irrespective of g-loadings or task difficulty : “Scores based on vehicles that are superficially different though essentially similar to the specific skills trained in the treatment condition may show gains attributable to near transfer but fail to show any gain on vehicles that require far transfer, even though both the near and the far transfer tests are equally g-loaded in the untreated sample. Any true increase in the level of g connotes more than just narrow (or near) transfer of training; it necessarily implies far transfer.” (p. 334).

But worse has yet to come. The hypothesis of environmental multipliers surely necessitates that differences in behavior have little or no genetic component, that behavior is highly malleable. Given the empirical literature (McGue & Lykken, 1992; McGue & Bouchard, 1998; Bouchard & McGue, 2003; Bouchard, 2004), this is not the case. Environment should not be treated as a pure environmental variable given the fact that even environments have some genetic component (Gottfredson, 2009, p. 50; Plomin, 2003, pp. 189-190; Plomin & Bergeman, 1991; Vinkhuyzen et al., 2009; Herrnstein & Murray, 1994, p. 314). Even culture could be under genetic influences (Plomin & Colledge, 2001, p. 231; Fuerle, 2008, pp. 66-67, 175, 257 fn. 2, 399-400 fn. 5). People may react differently to the same experiences depending on their genotypes (Rowe, 2001, pp. 68-72). Individuals indeed select, create, reshape their own environments on the basis of their genetic predisposition (Rowe, 2003, pp. 79-80). With age, the correlation gene-environment indeed shifts from passive to active form (Jensen, 1998, pp. 179, 181). Rowe (1997) rightly notes that it is not necessarily easy for the parents to manipulate a child’s environment : “Parents do affect their children, but the direction of that “nudge” is often unpredictable. Encouraging one child to study hard may make that child get better grades, whereas a brother or sister may rebel against being “bossed” by the parents.” (p. 141). This is why, precisely, Dickens and Flynn (2001, p. 363) proposal is unrealistic, as they write :

… intervention programs are able to change them and take children’s “control” over them away, which means that the environment that affects a child’s IQ must be external to the child or at least subject to manipulation by outsiders.

Furthermore, if Dickens and Flynn are thinking about motivational factors, they probably should reconsider their point of view as well. As Jensen (1980, p. 322) noted a long ago :

Nothing reinforces the behavioral manifestations of motivation as much as success itself. Abler students are rewarded by greater success, which in turn reinforces the kinds of behavior – attention, interest, persistence, and good study habits – that lead to further academic success. The repeated failures of less able students generally have just the opposite effect.

As the literature shows (Lai, 2011, pp. 15, 35-36), earlier performance encourages later motivation, rather than the reverse, that is, from earlier motivation to later performance. Of course, if blacks had some initial physical advantages (Saletan, 2008; Fuerle, 2008, pp. 142, 179) they will be more inclined to practice sports instead of pursuing school than any other group, especially if their rate of failure at school was higher to begin with. Because of racial differences in traits non-related to IQ, we should not expect members from different groups to create and shape similar environments in terms of cognitive stimulation.

Finally, it can be added that while their model aimed to explain the black-white IQ gap, it fails to note that the IQ gap is larger at higher SES levels. Surely, no explanation is offered to understand this. Another crucial point is that IQ regression to the mean (Jensen, 1973, 117-119, 1998, pp. 468-471), from sibling correlation analyses or parent-child correlation analyses, does not square with Dickens-Flynn model because they posit that low-IQ people will show IQ loss and high-IQ people will show IQ gain due to negative/positive feedback loop. The evidence shows the exact opposite of what they would have predicted.

Whether we stick with Flynn Effect or Dickens-Flynn feedback loop model, they are totally irrelevant as to the understanding of the nature of the black-white difference.

References :

  1. Ang SiewChing, Rodgers Joseph Lee, & Wänström Linda, 2010, The Flynn Effect within subgroups in the U.S.: Gender, race, income, education, and urbanization differences in the NLSY-Children data.
  2. Ashton Michael C., and Lee Kibeom, 2005, Problems with the method of correlated vectors.
  3. Bandeira Denise R., Costa Angelo, & Arteche Adriane, 2012, The Flynn effect in Brazil: Examining generational changes in the Draw-a-Person and in the Raven’s Coloured Progressive Matrices.
  4. Beaujean A. Alexander, and Osterlind Steven J., 2008, Using Item Response Theory to assess the Flynn Effect in the National Longitudinal Study of Youth 79 Children and Young Adults data.
  5. Beaujean A. Alexander, and Sheng Yanyan, 2010, Examining the Flynn Effect in the General Social Survey Vocabulary test using item response theory.
  6. Carpenter Patricia A., Just Marcel Adam, and Shell Peter, 1990, What One Intelligence Test Measures: A Theoretical Account of the Processing in the Raven Progressive Matrices Test.
  7. Chuck, Race, genes, and disparity, February 17, 2011, Spearman’s hypothesis and the Jensen Effect.
  8. Dickens William T., 2005, Genetic Differences and School Readiness.
  9. Dickens William T., 2009, A Response to Recent Critics of Dickens and Flynn (2001).
  10. Dickens William T., and Flynn James R., 2001, Heritability estimates versus large environmental effects: The IQ paradox resolved.
  11. Dickens William T., and Flynn James R., 2002, The IQ Paradox: Still Resolved.
  12. Dolan Conor. V., 2000, Investigating Spearman’s hypothesis by means of multi-group confirmatory factor analysis.
  13. Dolan Conor V., and Hamaker Ellen L., 2001, Investigating black–white differences in psychometric IQ: Multi-group confirmatory factor analysis of the WISC-R and K-ABC and a critique of the method of correlated vectors.
  14. Flynn James R., 2000, IQ gains, WISC subtests and fluid g: g theory and the relevance of Spearman’s hypothesis to race, in The Nature of Intelligence (Wiley).
  15. Flynn James R., 2009, Requiem for nutrition as the cause of IQ gains: Raven’s gains in Britain 1938–2008.
  16. Flynn James R., 2010a, The spectacles through which I see the race and IQ debate.
  17. Flynn James R., 2010b, Problems With IQ Gains: The Huge Vocabulary Gap.
  18. Fox M. C., Mitchum A. L., 2012, A Knowledge-Based Theory of Rising Scores on “Culture-Free” Tests.
  19. Fuerle Richard D., 2008, Erectus Walks Amongst Us: The evolution of modern humans.
  20. Gottfredson Linda S., 2007, Shattering Logic to Explain the Flynn Effect.
  21. Gottfredson Linda S., 2008, Of What Value Is Intelligence?.
  22. Gottfredson Linda S., 2009, Logical fallacies used to dismiss the evidence on intelligence testing.
  23. Herrnstein Richard J., and Murray Charles, 1994, The Bell Curve: Intelligence and Class Structure in American Life, With a New Afterword by Charles Murray.
  24. Jensen Arthur R., 1973, Educability and Group Differences.
  25. Jensen Arthur R., 1980, Bias in Mental Testing.
  26. Jensen Arthur R., 1998, The g Factor: The Science of Mental Ability.
  27. Juan-Espinosa Manuel, Cuevas Lara, Escorial Sergio, & García Luis F., 2006, The differentiation hypothesis and the Flynn effect.
  28. Kanaya Tomoe, and Ceci Stephen J., 2010, The Flynn Effect in the WISC Subtests Among School Children Tested for Special Education Services.
  29. Kane Harrison D., 2000, A secular decline in Spearman’s g: evidence from the WAIS, WAIS-R and WAIS-III.
  30. Kane Harrison D., & Oakland Thomas D., 2000, Secular Declines in Spearman’s g: Some Evidence From the United States.
  31. Kaufman Alan S., 2010a, “In What Way Are Apples and Oranges Alike?” A Critique of Flynn’s Interpretation of the Flynn Effect.
  32. Kaufman Alan S., 2010b, Looking Through Flynn’s Rose-Colored Scientific Spectacles.
  33. Lai Emily R., 2011, Motivation: A Literature Review.
  34. Loehlin John C., 2002, The IQ Paradox: Resolved? Still an Open Question.
  35. Lubke Gitta H., Dolan Conor V., Kelderman Henk, and Mellenbergh Gideon J., 2003, On the relationship between sources of within- and between-group differences and measurement invariance in the common factor model.
  36. Lynn Richard, 2009, What has caused the Flynn effect? Secular increases in the Development Quotients of infants.
  37. Lynn Richard, & Cooper Colin, 1993, A secular decline in Spearman’s g in France.
  38. Lynn Richard, & Cooper Colin, 1994, A Secular Decline in the Strength of Spearman’s g in Japan.
  39. Mingroni Michael A., 2007, Resolving the IQ Paradox: Heterosis as a Cause of the Flynn Effect and Other Trends.
  40. Murray Charles, 2005, The Inequality Taboo.
  41. Must Olev, Must Aasa, and Raudik Vilve, 2003, The secular rise in IQs: In Estonia, the Flynn effect is not a Jensen effect.
  42. Must Olev, & Must Aasa, 2013, Changes in test-taking patterns over time.
  43. Must Olev, te Nijenhuis Jan, Must Aasa, and van Vianen Annelies E.M., 2009, Comparability of IQ scores over time.
  44. Nettelbeck Ted, & Wilson Carlene, 2004, The Flynn effect: Smarter not faster.
  45. Pietschnig Jakob, Tran Ulrich S., Voracek Martin, 2013, Item-response theory modeling of IQ gains (the Flynn effect) on crystallized intelligence: Rodgers’ hypothesis yes, Brand’s hypothesis perhaps.
  46. Plomin Robert, 2003, General Cognitive Ability, in Behavioral Genetics in the Postgenomic Era.
  47. Plomin Robert, and Bergeman C. S., 1991, The nature of nurture: Genetic influence on “environmental” measures.
  48. Plomin Robert, and Colledge Essi, 2001, Genetics and Psychology: Beyond Heritability.
  49. Rodgers Joseph L., 1999, A Critique of the Flynn Effect: Massive IQ Gains, Methodological Artifacts, or Both?.
  50. Rönnlund Michael, Carlstedt Berit, Blomstedt Yulia, Nilsson Lars-Göran, and Weinehall Lars, 2013, Secular trends in cognitive test performance: Swedish conscript data 1970–1993.
  51. Rowe David C., 1997, A Place at the Policy Table? Behavior Genetics and Estimates of Family Environmental Effects on IQ.
  52. Rowe David C., 2001, Do People Make Environments or Do Environments Make People?.
  53. Rowe David C., 2003, Assessing Genotype-Environment Interactions and Correlations in the Postgenomic Era, in Behavioral Genetics in the Postgenomic Era.
  54. Rowe David C., & Rodgers Joseph L., 2002, Expanding Variance and the Case of Historical Changes in IQ Means: A Critique of Dickens and Flynn (2001).
  55. Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1994, No More Than Skin Deep: Ethnic and Racial Similarity in Developmental Process.
  56. Rowe David C., Vazsonyi Alexander T., and Flannery Daniel J., 1995, Ethnic and Racial Similarity in Developmental Process: A Study of Academic Achievement.
  57. Rowe David C., and Cleveland Hobard H., 1996, Academic Achievement in Blacks and Whites: Are the Developmental Processes Similar?.
  58. Rushton J. Philippe, 1999, Secular gains in IQ not related to the g factor and inbreeding depression – unlike Black-White differences: A reply to Flynn.
  59. Rushton J. Philippe, and Jensen Arthur R., 2010, The rise and fall of the Flynn Effect as a reason to expect a narrowing of the Black–White IQ gap.
  60. Saletan William, 2008, Race, genes, and sports.
  61. Shayer Michael, Ginsburg Denise, & Coe Robert, 2007, Thirty years on – a large anti-Flynn effect? The Piagetian test Volume & Heaviness norms 1975-2003.
  62. Shiu William, Beaujean A. Alexander, Must Olev, te Nijenhuis Jan, Must Aasa, 2013, An item-level examination of the Flynn effect on the National Intelligence Test in Estonia.
  63. Sundet Jon Martin, Barlaug Dag G., Torjussen Tore M., 2004, The end of the Flynn effect? A study of secular trends in mean intelligence test scores of Norwegian conscripts during half a century.
  64. te Nijenhuis Jan, 2012, The Flynn effect, group differences, and g loadings.
  65. te Nijenhuis Jan, van Vianen Annelies E.M., van der Flier Henk, 2007, Score gains on g-loaded tests : No g.
  66. te Nijenhuis Jan, & van der Flier Henk, 2013, Is the Flynn effect on g?: A meta-analysis.
  67. Vinkhuyzen Anna A. E., van der Sluis Sophie, de Geus E. J. C., Boomsma Dorret I., and Posthuma Danielle, 2009, Genetic influences on ‘environmental’ factors.
  68. Wai Jonathan, and Putallaz Martha, 2011, The Flynn effect puzzle: A 30-year examination from the right tail of the ability distribution provides some missing pieces.
  69. Wicherts Jelte M., & Dolan Conor V., 2010, Measurement Invariance in Confirmatory Factor Analysis: An Illustration Using IQ Test Performance of Minorities.
  70. Wicherts Jelte M., Dolan Conor V., Hessen David J., Oosterveld Paul, van Baal G. Caroline M., Boomsma Dorret I., Span Mark M., 2004, Are intelligence tests measurement invariant over time? Investigating the nature of the Flynn effect.
  71. Williams L. Robert, 2013, Overview of the Flynn effect.
  72. Woodley Michael A., 2011a, A life history model of the Lynn–Flynn effect.
  73. Woodley Michael A., 2011b, The Cognitive Differentiation-Integration Effort Hypothesis: A Synthesis Between the Fitness Indicator and Life History Models of Human Intelligence.
  74. Woodley Michael A., 2011c, Heterosis Doesn’t Cause the Flynn Effect: A Critical Examination of Mingroni (2007).
  75. Woodley Michael A., & Figueredo Aurelio José, Brown Sacha D., Ross Kari C., 2013, Four successful tests of the Cognitive Differentiation-Integration Effort hypothesis.
  76. Woodley Michael A., & Madison Guy, 2013, Establishing an association between the Flynn effect and ability differentiation.
  77. Woodley Michael A., & Meisenberg Gerhard, 2013, In the Netherlands the anti-Flynn effect is a Jensen effect.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s