In a widely cited paper (2008) Jaeggi has shown that memory working training can boost IQ substantially. Jaeggi and her colleagues use the “Dual n-back”, consisting in two independent sequences or tasks that are presented simultaneously. One is auditory, and the other one is visual. See the Figure 1 presented by Jaeggi :
As they say, practicing a dual n-back task could raise fluid intelligence, that is, the ability to think logically and solve novel problems. Let’s recall that working memory training would be totally ineffective if such programs have effects on tasks similar to those that have been trained. This is because g deals with new, novel problem solving.
The finding that the transfer to Gf remained even after taking the specific training effect into account seems to be counterintuitive, especially because the specific training effect is also related to training time. The reason for this capacity might be that participants with a very high level of n at the end of the training period may have developed very task specific strategies, which obviously boosts n-back performance, but may prevent transfer because these strategies remain too task-specific (5, 20). The averaged n-back level in the last session is therefore not critical to predicting a gain in Gf; rather, it seems that working at the capacity limit promotes transfer to Gf.
There are some weaknesses in their study, though. First, as Sternberg (2008) noted, Jaeggi’s result should not embolden egalitarians (see also, Conway & Getz, 2010). The use of only one training task and the absence of alternative task are serious limitations, apart from the question of the maintaining of the predictive power of the fluid-ability tests after training and the durability of the gains :
First, with regard to the main independent variable, there was only one training task in the study, so it is unclear to what extent the results can be generalized to other working-memory tasks. It would be important to show that the results are really about working memory rather than some peculiarity of the particular training task.
Second, with regard to the main dependent variable, there was only one kind of fluid-ability test, geometric matrix problems from various tests of the kind found in the Raven Progressive Matrices (15) and similar tests. It would be important to show that the results generalize to other fluid-ability tasks rather than being peculiar to this kind of task. Matrix problems are generally considered to be an excellent measure of fluid intelligence (16), but they do place a particularly great demand on working memory. At the same time, fluid-ability tests tend to be highly correlated with each other (17), so generalization would appear likely. Whether generalization extends beyond the matrix tests to other kinds of cognitive tests, such as of spatial, numerical, or other abilities, remains to be seen. …
Sixth, the control group in Jaeggi et al.’s study (10) had no alternative task, which can lead readers to query whether a placebo treatment in an additional control group might have led to a stronger comparison. In future work, one would want to include a training alternative that teaches something expected not to be relevant to performance on the fluid-ability tests.
But the detail which severely undermines Jaeggi (2008) is the speeding of the IQ test (10 minutes instead of 45 minutes) for the groups administered with the BOMAT. Soon after the study has been published, Moody (2009) has highlighted the flaws, if not fraud, with Jaeggi’s study. Jaeggi has deliberately reduced the time allowed for the test in the experimental group so that the subjects were not given the opportunity to perform the more complex items. Thus the test failed to challenge their fluid intelligence. Worse, the dual n-back task, while being different from the RAPM and the BOMAT (two similar IQ tests), does not appear to be very different from the transfer task, which is IQ test. Indeed, one of the two working-memory tasks “involved recall of the location of a small square in one of several positions in a visual matrix pattern”. It does not appear to be very different from the nature of the transfer task (which is BOMAT). This makes us hard to accept Jaeggi’s claim that “the trained task is entirely different from the intelligence test itself”. Not surprising if Jaeggi et al. (2008) have found improvement in scores for BOMAT, but not for RAPM. Anyway, Moody’s criticism (2009) is worth quoting at length :
The subjects were divided into four groups, differing in the number of days of training they received on the task of working memory. The group that received the least training (8 days) was tested on Raven’s Advanced Progressive Matrices (Raven, 1990), a widely used and well-established test of fluid intelligence. This group, however, demonstrated negligible improvement between pre- and post-test performance.
The other three groups were not tested using Raven’s Matrices, but rather on an alternative test of much more recent origin. The Bochumer Matrices Test (BOMAT) (Hossiep, Turck, & Hasella, 1999) is similar to Raven’s in that it consists of visual analogies. In both tests, a series of geometric and other figures is presented in a matrix format and the subject is required to infer a pattern in order to predict the next figure in the series. The authors provide no reason for switching from Raven’s to the BOMAT.
The BOMAT differs from Raven’s in some important respects, but is similar in one crucial attribute: both tests are progressive in nature, which means that test items are sequentially arranged in order of increasing difficulty. A high score on the test, therefore, is predicated on subjects’ ability to solve the more difficult items.
However, this progressive feature of the test was effectively eliminated by the manner in which Jaeggi et al. administered it. The BOMAT is a 29-item test which subjects are supposed to be allowed 45 min to complete. Remarkably, however, Jaeggi et al. reduced the allotted time from 45 min to 10. The effect of this restriction was to make it impossible for subjects to proceed to the more difficult items on the test. The large majority of the subjects — regardless of the number of days of training they received — answered less than 14 test items correctly.
By virtue of the manner in which they administered the BOMAT, Jaeggi et al. transformed it from a test of fluid intelligence into a speed test of ability to solve the easier visual analogies.
The time restriction not only made it impossible for subjects to proceed to the more difficult items, it also limited the opportunity to learn about the test — and so improve performance — in the process of taking it. This factor cannot be neglected because test performance does improve with practice, as demonstrated by the control groups in the Jaeggi study, whose improvement from pre- to post-test was about half that of the experimental groups. The same learning process that occurs from one administration of the test to the next may also operate within a given administration of the test — provided subjects are allowed sufficient time to complete it.
Since the whole weight of their conclusion rests upon the validity of their measure of fluid intelligence, one might assume the authors would present a careful defense of the manner in which they administered the BOMAT. Instead they do not even mention that subjects are normally allowed 45 min to complete the test. Nor do they mention that the test has 29 items, of which most of their subjects completed less than half.
The authors’ entire rationale for reducing the allotted time to 10 min is confined to a footnote. That footnote reads as follows:
Although this procedure differs from the standardized procedure, there is evidence that this timed procedure has little influence on relative standing in these tests, in that the correlation of speeded and non-speeded versions is very high (r = 0.95; ref. 37).
The reference given in the footnote is to a 1988 study (Frearson & Eysenck, 1986) that is not in fact designed to support the conclusion stated by Jaeggi et al. The 1988 study merely contains a footnote of its own, which refers in turn to unpublished research conducted forty years earlier. That research involved Raven’s matrices, not the BOMAT, and entailed a reduction in time of at most 50%, not more than 75%, as in the Jaeggi study.
So instead of offering a reasoned defense of their procedure, Jaeggi et al. provide merely a footnote which refers in turn to a footnote in another study. The second footnote describes unpublished results, evidently recalled by memory over a span of 40 years, involving a different test and a much less severe reduction in time.
In this context it bears repeating that the group that was tested on Raven’s matrices (with presumably the same time restriction) showed virtually no improvement in test performance, in spite of eight days’ training on working memory. Performance gains only appeared for the groups administered the BOMAT. But the BOMAT differs in one important respect from Raven’s. Raven’s matrices are presented in a 3 × 3 format, whereas the BOMAT consists of a 5 × 3 matrix configuration.
With 15 visual figures to keep track of in each test item instead of 9, the BOMAT puts added emphasis on subjects’ ability to hold details of the figures in working memory, especially under the condition of a severe time constraint. Therefore it is not surprising that extensive training on a task of working memory would facilitate performance on the early and easiest BOMAT test items — those that present less of a challenge to fluid intelligence.
This interpretation acquires added plausibility from the nature of one of the two working-memory tasks administered to the experimental groups. The authors maintain that those tasks were “entirely different” from the test of fluid intelligence. One of the tasks merits that description: it was a sequence of letters presented auditorily through headphones.
But the other working-memory task involved recall of the location of a small square in one of several positions in a visual matrix pattern. It represents in simplified form precisely the kind of detail required to solve visual analogies. Rather than being “entirely different” from the test items on the BOMAT, this task seems well-designed to facilitate performance on that test.
Then, Jaeggi et al. (2010, p. 9) counter Moody by arguing that the speeded administration and the non-speeded administration should yield a similar result :
Moody (2009) has argued that restricting the tests to just the early items leaves out the items that have higher Gf loadings. This issue has been addressed before by other researchers who investigated whether there are differential age effects or working memory involvement in the different parts of the APM (Salthouse, 1993; Unsworth & Engle, 2005). These studies found no evidence for differential processes in the various items of the APM, at least for the first three quartiles of the task; thus, it seems unlikely that a subset of items in the APM measures something different than Gf. In our own data, the transfer effects were actually more pronounced for the second half of the test in the APM, which is reflected in a significant 3-way interaction … . In the BOMAT, we observed no differential transfer effects for the earlier vs later items … . Thus, if there are any differences in Gf loading in the various parts of the matrices tasks, the present data suggest that the transfer effects are roughly equivalent for the parts of the test that are claimed to have higher vs lower Gf loadings.
Chooi (2011, pp. 11-12), after discussing Jaeggi (2008), disagrees with Jaeggi’s defense (2010). Indeed, Chooi (2011, p. 62) failed to replicate Jaeggi (2008, 2010). Subsequently, Chooi & Thompson (2012, pp. 537-538) tried to replicate Jaeggi. Again, they report a null result with a further evidence that, contrary to Jaeggi’s claim (2010, p. 9), a non-speeded administration matters :
The numbers suggested that participants from the current study and the original Jaeggi et al. (2008) study showed very similar performance on the training task; however, upon closer inspection of the data reported by Jaeggi et al., 2008, the authors collapsed post-test scores for all training groups and concluded a significant improvement in performance on intelligence tests after training based on an increase in about 2.5 points. This is misleading and inappropriate since not all participants took the same test for the purposes of detecting transfer effects. … Although our data on training improvement cannot be directly compared with those obtained from the two studies by Jaeggi et al. (2008, 2010), we argue that participants who trained their working memory in the current study improved on the training task just as much and that participants in the current study were just as motivated and committed as participants in the original study conducted by Jaeggi et al. (2008).
Participants in the current study and those in the studies conducted by Jaeggi et al. (2008, 2010) took the test under different administration procedures. RAPM was administered with no time constraint in the current study as recommended by the test provider, so participants were allowed to solve as many items as they could under no time pressure. Jaeggi and her colleagues administered their transfer tasks, the RAPM and BOMAT, with a time constraint — participants in their studies only had 10 min to solve as many items as they could (Jaeggi et al., 2008). In their first study, those in the 19-day training group answered about 4 more items on the BOMAT correctly at post-test (Jaeggi et al., 2008) and in their second study, the 20-day training group correctly answered 3 additional items in 16 min at post-test (Jaeggi et al., 2010). In their replication study, participants answered 2 additional items on the RAPM in 11 min after training for 20 days (Jaeggi et al., 2010). There was inconsistent usage of transfer tasks in the original study, where the researchers used the RAPM in the 8-day condition and not in the other training conditions. Participants who trained for 8-days showed no significant improvement on the RAPM at post-test (Jaeggi et al., 2008).
… Participants were told that there were no time constraints and they could take as much time as they wanted to complete the items on both tests, so there were participants who took more than 20 min to complete both tests. Similarly, participants were given 15–20 min at the beginning of post-test session to work on the Mill-Hill and RAPM before the timed tests and OSPAN were administered. In essence, participants in the current study had as much time as those in the studies carried out by Jaeggi et al. (2008, 2010) with the added advantage of no time pressure exerted on the participants. Though Jaeggi et al. argued that the timed administration of RAPM/BOMAT in their studies was not an issue, the untimed administration of the same test in our study showed no significant improvements in RAPM scores.
The current study was designed to replicate and extend the original study by Jaeggi et al. (2008); thus, it was designed not only to detect an increase in scores but also to determine how the increase in performance arose should there be any, whether through improvements in verbal, perceptual or spatial rotation abilities following Johnson and Bouchard’s (2005a) VPR model of intelligence. If general intelligence was improved after working memory training, it is imperative to know what underlying ability(ies) specifically led to an increase in general intelligence. The series of short, timed mental abilities tests administered in the current study were to provide additional information should there be an increase in the transfer task, RAPM. These tests were selected based on Johnson and Bouchard’s (2005a) proposed model of intelligence, and exploratory factor analysis conducted on the test variables at pre-test (N=117) in the current study supported the model (Table 5). However, results from the current study suggested no improvement overall in each of the three abilities.
Thompson et al. (2013, pp. 9-11) showed subsequently that while improvement (after 20 sessions) in trained tasks persisted for no less than 6 months, there was no transfer effect for the non-trained tasks. Moreover, they report that personality assessments and motivation were not clearly associated with positive transfer outcomes. While Jaeggi (2008) found a significant effect size, Thompson’s study had 33% more training per session but no greater gains.
Earlier, Jaeggi (2011) strikes back with another study, supporting her previous findings :
The Table 1 displays the results (SPM means Raven’s Standard Progressive Matrices and TONI means Test of Nonverbal Intelligence). Strangely enough, the control group “who trained on a knowledge-based task that did not engage working memory” is left with a higher SPM score than the group with large training gain whose score regressed in the follow-up compared to post-test. The control group also outperformed the group with large and small training gain on the TONI in the follow-up test.
Nevertheless, even though we accept Jaeggi’s conclusion, it remains to see whether the studies in this particular field yield a positive result. Overall, the findings seem to be mixed, if not pessimistic (Shipstead et al., 2010, pp. 255-270; Seidler et al., 2010; Fox & Charness, 2010; Zinke et al. 2012; Brehmer et al., 2012).
A meta-analysis should be helpful in this matter. In fact, this is exactly what Monica Melby-Lervåg and Charles Hulme (2012) have done. A detail worth noting is that in their meta-analysis, they have excluded some studies for not having applied adequate methodological criteria (pp. 4-5), notably the use of “no-contact” control groups which inflates the effect size due to expectancy or placebo effect, the so-called Hawthorne effect (Shipstead, 2012b, p. 191; see also Shipstead et al., 2010, pp. 251-252, for a thorough discussion) which seems to be common in research studies. Among other problems, the regression to the mean effect, i.e., the tendency of a low-scoring subject to perform better at post-test and for a high-scoring subject to perform worse at post-test, also has to be taken care of by randomizing the assignment to the different groups of study. This being said, regarding the studies included in their meta-analysis, they found an overall effect size ranging from 0.16 to 0.23, which means that the magnitude of the effect between the treated and untreated groups is relatively small. A moderator analysis (table 2) reveals however d values of 0.04 for studies using randomization procedures and 0.00 for those using active (i.e., treated) control group, concerning immediate effect on nonverbal abilities. Here are the findings :
Immediate Effects of Working Memory Training on Far-Transfer Measures
Nonverbal ability. Figure 6 shows the 22 effect sizes comparing the pretest–posttest gains between working memory training groups and control groups on nonverbal ability (N training groups = 628, mean sample size = 28.54, N controls = 528, mean sample size = 24.0). The mean effect size was small (d = 0.19), 95% CI [0.03, 0.37], p = .02. The heterogeneity between studies was significant, Q(21) = 39.17, p < .01, I² = 46.38%. The funnel plot indicated a publication bias to the right of the mean (i.e., studies with a higher effect size than the mean appeared to be missing), and in a trim and fill analysis, the adjusted effect size after imputation of five studies was d = 0.34, 95% CI [0.17, 0.52]. A sensitivity analysis showed that after removing outliers, the overall effect size ranged from d = 0.16, 95% CI [0.00, 0.32], to d = 0.23, 95% CI [0.06, 0.39].
Moderators of immediate transfer effects of working memory training to measures of nonverbal ability are shown in Table 2. There was a significant difference in outcome between studies with treated controls and studies with only untreated controls. In fact, the studies with treated control groups had a mean effect size close to zero (notably, the 95% confidence intervals for treated controls were d = -.24 to 0.22, and for untreated controls d = 0.23 to 0.56). More specifically, several of the research groups demonstrated significant transfer effects to nonverbal ability when they used untreated control groups but did not replicate such effects when a treated control group was used (e.g., Jaeggi, Buschkuehl, Jonides, & Shah, 2011; Nutley, Söderqvist, Bryde, Thorell, Humphreys, & Klingberg, 2011). Similarly, the difference in outcome between randomized and nonrandomized studies was close to significance (p = .06), with the randomized studies giving a mean effect size that was close to zero. Notably, all the studies with untreated control groups are also nonrandomized; it is apparent from these analyses that the use of randomized designs with an alternative treatment control group are essential to give unambiguous evidence for training effects in this field.
Long-Term Effects of Working Memory Training on Transfer Measures
Table 4 shows the total number of participants in training and control groups, the total number of effect sizes, the time between the posttest and the follow-up, and the mean difference in gain between training and control groups from the pretest to the follow-up. It is apparent that all these long-term effects were small and nonsignificant. The true heterogeneity between studies was zero for all variables, indicating that the results were consistent across the studies included here. The funnel plot with trim and fill analyses did not indicate any publication bias. As for the attrition rate, on average, the studies lost 10% of the participants in the training group and 11% of the participants in the control group between the posttest and the follow-up. Only one study with two independent comparisons reported long-term effects for verbal ability (E. Dahlin et al., 2008). For the younger sample in this study, with 11 trained and seven control participants, long-term effects for verbal ability was nonsignificant (d = 0.46) 95% CI [-0.45, 1.37]. For the older participants in this study (13 trained, seven controls), the long term effects were negative and nonsignificant (d = -0.08), 95% CI [-0.96, 0.80].
In summary, there is no evidence from the studies reviewed here that working memory training produces reliable immediate or delayed improvements on measures of verbal ability, word reading, or arithmetic. For nonverbal reasoning, the mean effect across 22 studies was small but reliable immediately after training. However, these effects did not persist at the follow-up test, and in the best designed studies, using a random allocation of participants and treated controls, even the immediate effects of training were essentially zero. For attention (Stroop task), there was a small to moderate effect immediately after training, but the effect was reduced to zero at follow-up.
Methodological Issues in the Studies of Working Memory Training
Several studies were excluded because they lack a control group, since as outlined in the introduction, such studies cannot provide any convincing support for the effects of an intervention (e.g., Holmes et al., 2010; Mezzacappa & Buckner, 2010). However, among the studies that were included in our review, many used only untreated control groups. As demonstrated in our moderator analyses, such studies typically overestimated effects due to training, and research groups who demonstrated transfer effects when using an untreated control group typically failed to replicate such effects when using treated controls (Jaeggi, Buschkuehl, Jonides, & Shah, 2011; Nutley, Söderqvist, Bryde, Thorell, Humphreys, & Klingberg, 2011). Also, because the studies reviewed frequently use multiple significance tests on the same sample without correcting for this, it is likely that some group differences arose by chance (for example, if one conducts 20 significance tests on the same data set, the Type 1 error rate is 64% (Shadish, Cook, & Campbell, 2002). Especially if only a subset of the data is reported, this can be very misleading.
Finally, one methodological issue that is particularly worrying is that some studies show far-transfer effects (e.g., to Raven’s matrices) in the absence of near-transfer effects to measures of working memory (e.g., Jaeggi, Busckuehl, Jonides, & Perrig, 2008; Jaeggi et al., 2010). We would argue that such a pattern of results is essentially uninterpretable, since any far-transfer effects of working memory training theoretically must be caused by changes in working memory capacity. The absence of working memory training effects, coupled with reliable effects on far-transfer measures, raises concerns about whether such effects are artifacts of measures with poor reliability and/or Type 1 errors. Several of the studies are also potentially vulnerable to artifacts arising from regression to the mean, since they select groups on the basis of extreme scores but do not use random assignment (e.g., Holmes, Gathercole, & Dunning, 2009; Horowitz-Kraus & Breznitz, 2009; Klingberg, Forssberg, & Westerberg, 2002).
Apart from Melby-Lervåg meta-analysis, some other experiments not mentioned in her study also displayed a negative result, failing to replicate Jaeggi and others. Shipstead et al. (2012a) and Redick et al. (2012), for example.
In Is Working Memory Training Effective?, Shipstead et al (p. 6) argue that cognitive abilities in most studies are measured via single tasks, with the inherent problem that scores on any single are driven both by the ability of interest as well as other systematic and random influences. To have a clearer picture of the task transfer, a variety of tests within a battery must be included. In Figure 3, below, the numbers at the extreme left of the picture shows the correlations or factor loadings, indicating the portion of each task that is explained by WMC, ranging from 46% to 70% (obtaining by squaring the factor loadings). And they note of course “While some tasks provide stronger reflections of WM capacity than others, no single task reflects WM capacity alone.” (p. 7).
Likewise, far transfer tasks are not perfect measures of ability. In many training studies, Raven’s Progressive Matrices (Ravens; Raven, 1990, 1995, 1998) serves as the sole indicator of Gf. This “matrix reasoning” task presents test takers with a series of abstract pictures that are arranged in a grid. One piece of the grid is missing, and the test taker must choose an option (from among several) that completes the sequence. Jensen (1998) estimates that 64% of the variance in Ravens performance can be explained by Gf. Similarly, Figure 3 indicates that in the study of Kane et al. (2004), 58% of the Ravens variance was explained by Gf.
It is clear that Ravens is strongly related to Gf. However, 30%–40% of the variance in Ravens is attributable to other influences. Thus, when Ravens (or any other task) serves as the sole indicator of far transfer, performance improvements can be explained without assuming that a general ability has changed. Instead, it can be parsimoniously concluded that training has influenced something that is specific to performing Ravens, but not necessarily applicable to other reasoning contexts (Carroll, 1993; Jensen, 1998; Moody, 2009; Schmiedek et al., 2010; te Nijenhuis, van Vianen, & van der Flier, 2007).
… Preemption of criticisms such as Moody’s (2009) is, however, readily accomplished through demonstration of transfer to several measures of an ability.
Unfortunately, the practice of equating posttest improvement on one task with change to cognitive abilities is prevalent within the WM training literature (cf. Jaeggi et al., 2008; Klingberg, 2010). This is partially driven by the time and monetary costs associated with conducting multisession, multiweek studies. Regardless, training studies can greatly improve the persuasiveness of their results by measuring transfer via several tasks that differ in peripheral aspects but converge on an ability of interest (e.g., a verbal, Gf, and spatial task from Figure 3). If a training effect is robust, it should be apparent in all tasks.
Using active control group, which they consider being critical to elucidate the mechanisms responsible for transfer after any cognitive training, precisely a visual search training, “Because visual search performance is not likely to be determined by WM capacity, visual search training is unlikely to improve WM capacity. By comparing a visual search group that did not train WM to a dual n-back group that arguably did train WM, we can separate the potential transfer effects due to improving WM from those associated with placebo-type effects (Shipstead et al., 2010; Sternberg, 2008).” (p. 13), Redick et al. found in a total of 17 cognitive tasks (detailed pp. 18-23) that “Out of 17 ANOVAs, there were no significant Group x Session interactions”. The found result is presented as follows :
Given our emphasis on using multiple indicators of a construct instead of single tasks, we also examined transfer performance as a function of ability z-composites for the tasks representing fluid intelligence (separate spatial and nonspatial factors), multitasking, WM capacity, crystallized intelligence, and perceptual speed. … Overall, the z-composite analyses confirm the lack of transfer observed in the individual task analyses. Note that the marginal results for the gain from pre- to mid-test for the multitasking composite represent greater improvement for the no-contact control group relative to the other two groups (see Table 2).
To facilitate comparisons with previous research, we re-analyzed all transfer data excluding the visual search group. The results matched the full analyses – no transfer for the dual n-back group relative to the no-contact control group. In addition, we reanalyzed all transfer data excluding the mid-test session, comparing only the pre- and post-test sessions, and, again, we found no transfer for the dual n-back group relative to the no-contact or active control groups. Finally, we conducted ANCOVAs for each transfer task, using the pre-test score as the covariate. The results were qualitatively the same as the ANOVA results presented in Table 3. 
Variables that affect transfer
If WM training, and more specifically dual or single n-back training, can actually cause real improvements in fluid intelligence, then the diversity of transfer results across studies indicates that there are important boundary variables that can mediate or moderate training effectiveness. A recent study with children that trained on an adaptive single nback task identified the amount of n-back improvement as a critical variable determining whether or not transfer to intelligence was observed (Jaeggi, Buschkuehl, Jonides, & Shah, 2011). When Jaeggi et al. (2011) halved the n-back training group based on the amount of improvement observed, the children with the biggest gain showed transfer relative to an active-control group, whereas the children with smaller gains did not. We therefore attempted a similar analysis, by dividing our dual n-back subjects into high and low improvement groups, using a median split on the difference score of mean dual nback level in sessions 19-20 versus sessions 1-2. This post-hoc analysis is limited by sample size (only 12 and 12 subjects in the high- and low-improvement groups, respectively), but with that caveat in mind, no significant Group (high dual n-back improvement, low dual n-back improvement, no-contact control) effects were obtained for the fluid intelligence, multitasking, and WM composite standardized gains (Table 7). A similar median-split analysis for the visual search group (15 and 14 subjects in the high- and low-improvement groups, respectively) also produced no significant Group effects on the composite standardized gains (Table 7).
We also correlated the amount of training gain and transfer gain for the same four standardized gain composites (Table 7). Dual n-back improvement was not associated with fluid intelligence gains; it was marginally correlated with WM capacity improvement but, surprisingly, visual search improvement was also correlated with improvement on the verbal fluid intelligence tasks (Supplemental Materials, Figure S4). Other WM training studies (Chein & Morrison, 2010; Jaeggi et al., 2011) reporting significant correlations between training change and transfer change suffer from the same limitations as our data for such correlational analyses – small sample sizes and the influence of subjects who performed worse at post-test than pre-test on the transfer tasks (i.e., negative value for transfer gain) and performed worse at the end of training than the beginning of training on the training task (i.e., negative value for training gain). Indeed, in our data, the correlation between visual search change and verbal fluid intelligence change was no longer significant, r(28) = .25, p = .20, after removing the lone subject who had negative values on both variables.
Earlier, Colom et al. (2010, pp. 5-7, Figures 1 & 2) did not find that improvements in scores on tests administered under challenging condition with longer duration are correlated with g-loading : “(a) the rate of performance improvements in tasks requiring high degrees of effortful (complex) processing was much smaller across sessions, but still related to pre-test intelligence scores, whereas (b) the rate of performance improvements in tasks requiring very low levels of effortful (simple) processing remained high across sessions, but progressively least correlated with pre-test intelligence scores”. Once again, improvement in working memory is no related to g.
Fig. 1. summarizes average changes (d units) for STM, WMC, PS and ATT from session 1 to session 2, as well as from session 2 to session 3. There is a great reduction of improvements for STM and WMC in session 3, while improvements remain at high levels for PS and ATT. We computed confidence intervals for the values shown in Fig. 1: STM (S1–S2 = .47 and .57; S2–S3 = .09 and .21), WMC (S1–S2 = .27 and .36; S2–S3 = .13 and .22), Speed (S1–S2 = .53 and .57; S2–S3 = .42 and .46), Attention (S1–S2 = .71 and .73; S2–S3 = .56 and .62). Therefore, improvements for STM and WMC in the last session are very small, whereas for Speed and Attention these improvements are still large.
Finally, Fig. 2 depicts correlations between average intelligence in the pretesting session and performance across sessions for STM, WMC, PS, and ATT. Interestingly, these correlations do not change across sessions for STM and WMC, whereas they show a systematic decrease for PS and ATT (z values for PS and ATT were 2.91 and 2.93 respectively, p < .01).
The studies’ conclusion, overall, is in line with te Nijenhuis et al. (2007) who have found a negative correlation between cognitive training and g-loading. One possible reason for this is that g is quite heritable (Hu, Sept.11.2013) and relatively stable (Beaver et al., 2013) during the developmental period.
3. First test of Jensen’s hypothesis: studies on repeated testing and g loadedness
In a classic study by Fleishman and Hempel (1955) as subjects were repeatedly given the same psychomotor tests, the g loading of the tests gradually decreased and each task’s specificity increased. Neubauer and Freudenthaler (1994) showed that after 9 h of practice the g loading of a modestly complex intelligence test dropped from .46 to .39. Te Nijenhuis, Voskuijl, and Schijve (2001) showed that after various forms of test preparation the g loadedness of their test battery decreased from .53 to .49. Based on the work of Ackerman (1986, 1987), it can be concluded that through practice on cognitive tasks part of the performance becomes overlearned and automatic; the performance requires less controlled processing of information, which is reflected in lowered g loadings.
The findings suggest that after training the g loadedness of the test decreased substantially. We found negative, substantial correlations between gain scores and RSPM total scores. Table 4 shows that the total score variance decreased after training, which is in line with low-g subjects increasing more than high-g subjects. Since, as a rule, high-g individuals profit the most from training – as is reflected in the ubiquitous positive correlation between IQ scores and training performance (Jensen, 1980; Schmidt & Hunter, 1998) – these findings could be interpreted as an indication that Feuerstein’s Mediated Learning Experience is not g-loaded, in contrast with regular trainings that are clearly g-loaded. Substantial, negative correlations between gain scores and RSPM total scores are no definite proof of this hypothesis, but are in line with it. Additional substantiation of our hypothesis that the Feuerstein training has no or little g loadedness is that Coyle (2006) showed that gain scores loaded virtually zero on the g factor. Moreover, Skuy et al. reported that the predictive validity of their measure did not increase when the second Raven score was used. The fact that individuals with low-g gained more than those with high-g could be interpreted as an indication that the Mediated Learning Experience was not g-loaded. It should be noted, however, that Feuerstein most likely did not intend his intervention to be g-loaded. He was interested in increasing the performance of low scorers on both tests and external criteria.
18. General discussion
The findings show that not the high-g participants increase their scores the most – as is common in training situations – but it is the low-g persons showing the largest increases of their scores. This suggests that the intervention training is not g-loaded.
This story is indeed too good to be true. It is important to explain why we should expect forthcoming failures and deceptions from this kind of experiments, even if the score gains would have been proved to be durable, net of the confounders (Shipstead et al., 2010, p. 251).
First, the gains and transfer effects must be maintained the years after the test. There would be no reason for the duration however. Even the most subsidized, intensive educational interventions aimed to raise the IQ of the poorest american childs have all failed to produce enduring effect on IQ. Indeed, the IQ gains fade out with time (Herrnstein & Murray, 1994, pp. 405-406; Leak et al., 2010, pp. 8-10). As Hambrick (2012) pointed out : “We shouldn’t be surprised if extraordinary claims of quick gains in intelligence turn out to be wrong”. There should be no reason why hardest ways of improving IQ are deceptive when the easiest ways of improving IQ are optimistic. It seems, now, that the evidence appears crystal clear : Hulme & Melby-Lervåg (2012) and Shipstead et al. (2012b) demonstrate that the gains vanish within a few months and that those gains, specifically, seem to reflect a mere test-familiarity. A bunch of studies are methodologically flawed and a series of replication failure is apparent.
Second, to be effective, the IQ gains derived from work memory training must improve problem-solving tasks of the type that usually occur in real life. In other words, the increase in intelligence must have its impact in the real world. But why any improvement in a specific task would lead to an improvement in general ability ? Murray (2005) responds to this question in using this analogy : “Suppose you have a friend who is a much better athlete than you, possessing better depth perception, hand-eye coordination, strength, and agility. Both of you try high-jumping for the first time, and your friend beats you. You practice for two weeks; your friend doesn’t. You have another contest and you beat your friend. But if tomorrow you were both to go out together and try tennis for the first time, your friend would beat you, just as your friend would beat you in high-jumping if he practiced as much as you did”. Previously, Herrnstein and Murray have investigated in their famous book “The Bell Curve” (in 1994) the Milkwaukee Project, aimed to improve the IQ of blacks from poor black mothers with IQs below 75 through test coaching, which had failed in the task of improving real intelligence. Soon after the project began, it appeared that the IQ of blacks had improved by 25 points – a massive gain. By the age of 12 to 14 years, the gains seem to vanish among the blacks, but remain 10 points higher than the control group. However, the key finding is that this substantial increase in IQ was not accompanied by increases in school performance compared to the control group. Therefore :
… Charles Locurto and Arthur Jensen have concluded that the program’s substantial and enduring gain in IQ has been produced by coaching the children so well on taking intelligence tests that their scores no longer measure intelligence or g very well.
According to Jensen (1998, p. 340), “It was also the most costly single experiment in the history of psychology and education — over $14 million. In terms of the highest peak of IQ gains for the seventeen children in the treatment condition (before the gains began to vanish), the cost was an estimated $23,000 per IQ point per child”. He then notes that “in subsequent testings on more advanced IQ tests during the period of decline after age six, those subtests whose contents least resembled the kinds of material on which the children had been trained (and tested) showed the greatest decline. These tests, evidently, showed the least transfer of training. It should be noted that the IQ tests’ item contents differ somewhat at each age level, so in each subsequent year after age six, when the children entered regular school, the contents of the subsequent IQ tests became less and less similar to the materials they had been trained on in the Stimulation Center. Therefore, as the transfer of training effect gradually diminished, the tests increasingly reflected the children’s true level of g” (p. 342).
This phenomenon has been well documented by te Nijenhuis (2007), who established that IQ tests lose their g loadedness over time as a result of training, retesting, and familiarity. Needless to say, IQ does not reflect the ablity of repeating a same task but the ability to resolve a new problem. Intelligence tests are not ordinarily designed to withstand practice. Not surprising if cognitive training ‘improves’ IQ scores. Finally, the interesting fact is that if we were to take at face value the popular views that IQ scores were just and nothing more than the ability of doing well on cognitive tests, then we should have noticed a substantial IQ (g) gains after x training sessions. But this was not the case.
- Brehmer Yvonne, Westerberg Helena and Bäckman Lars, 2012, Working-memory training in younger and older adults: training gains, transfer, and maintenance.
- Chooi Weng Tink, 2011, Improving Intelligence by Increasing Working Memory Capacity.
- Chooi Weng Tink, and Thompson Lee A., 2012, Working memory training does not improve intelligence in healthy young adults.
- Colom Roberto, Quiroga Mª Ángeles, Shih Pei Chun, Martínez Kenia, Burgaleta Miguel, Martínez-Molina Agustín, Román Francisco J., Requena Laura, Ramírez Isabel, 2010, Improvement in working memory is not related to increased intelligence scores.
- Conway Andrew R.A., and Getz Sarah J., 2010, Cognitive Ability: Does Working Memory Training Enhance Intelligence?.
- Fox Mark C., and Charness Neil, 2010, How to Gain Eleven IQ Points in Ten Minutes: Thinking Aloud Improves Raven’s Matrices Performance in Older Adults.
- Hambrick David Z., 2012, I.Q. Points for Sale, Cheap.
- Herrnstein Richard J., and Murray Charles, 1994, The Bell Curve.
- Hulme Charles, Melby-Lervåg Monica, 2012, Current evidence does not support the claims made for CogMed working memory training.
- Jaeggi Susanne M., Buschkuehl Martin, Jonides John, and Perrig Walter J., 2008, Improving fluid intelligence with training on working memory.
- Jaeggi Susanne M., Studer-Luethi Barbara, Buschkuehl Martin, Su Yi-Fen, Jonides John, Perrig Walter J., 2010, The relationship between n-back performance and matrix reasoning — implications for training and transfer.
- Jaeggi Susanne M., Buschkuehl Martin, Jonides John, and Shah Priti, 2011, Short- and long-term benefits of cognitive training.
- Jensen R. Arthur, 1998, The g Factor: The Science of Mental Ability.
- Leak Jimmy, Duncan Greg J., Li Weilin, Magnuson Katherine, Schindler Holly, Yoshikawa Hirokazu, 2010, Is Timing Everything? How Early Childhood Education Program Impacts Vary by Starting Age, Program Duration and Time Since the End of the Program.
- Melby-Lervåg Monica, and Hulme Charles, 2012, Is Working Memory Training Effective? A Meta-Analytic Review.
- Moody David E., 2009, Can intelligence be increased by training on a task of working memory?.
- Murray Charles, 2005, The Inequality Taboo.
- Redick Thomas S., Shipstead Zach, Harrison Tyler L., Hicks Kenny L., Fried David E., Hambrick David Z., Kane Michael J., Engle Randall W., 2012, No evidence of intelligence improvement after working memory training: A randomized, placebo-controlled study.
- Seidler Rachael D., Bernard Jessica A., Buschkuehl Martin, Jaeggi Susanne, Jonides John, Humfleet Jennifer, 2010, Cognitive Training as an Intervention to Improve Driving Ability in the Older Adult.
- Shipstead Zach, Redick Thomas S., and Engle Randall W., 2010, Does working memory training generalize?.
- Shipstead Zach, Redick Thomas S., and Engle Randall W., 2012a, Is Working Memory Training Effective?.
- Shipstead Zach, Hicks Kenny L., Engle Randall W., 2012b, Cogmed working memory training: Does the evidence support the claims?.
- Sternberg Robert J., 2008, Increasing fluid intelligence is possible after all.
- te Nijenhuis Jan, van Vianen Annelies E.M., van der Flier Henk, 2007, Score gains on g-loaded tests : No g.
- Thompson Todd W. et al., 2013, Failure of Working Memory Training to Enhance Cognition or Intelligence.
- Zinke Katharina, Zeintl Melanie, Eschen Anne, Herzog Carole, Kliegel Matthias, 2012, Potentials and Limits of Plasticity Induced by Working Memory Training in Old-Old Age.