Science has gone awry

This was a recent article published at The Economist. Basically, I am not surprised at all about scientific misconduct and biases. This is why I don’t generally trust scientists. Anyway :

Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan. …

It is tempting to see the priming fracas as an isolated case in an area of science — psychology — easily marginalised as soft and wayward. But irreproducibility is much more widespread. A few years ago scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. According to a piece they wrote last year in Nature, a leading scientific journal, they were able to reproduce the original results in just six. Months earlier Florian Prinz and his colleagues at Bayer HealthCare, a German pharmaceutical giant, reported in Nature Reviews Drug Discovery, a sister journal, that they had successfully reproduced the published results in just a quarter of 67 seminal studies.

But the first sentence of the following paragraph sounds unlikely to me :

Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.

Because to acknowledge this is to lose its credibility, as a professor, as a scientist. More likely than not, they will surely admit of being wrong but not publicly. They will wait for the storm to pass, i.e., for others to forget that.

Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. “There is no cost to getting things wrong,” says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline’s persistent errors. “The cost is not getting them published.”

Statistical mistakes or misuse is a problem when it increases the variability of the results between studies, let alone the fact that they often use still other different procedures. This leads to the false conclusion, for most people (i.e., laypeople) not familiar with statistics, that a certain field of research does not yield promising results when in fact it was due to poor methodology.

Unlikeliness is a measure of how surprising the result might be. By and large, scientists want surprising results, and so they test hypotheses that are normally pretty unlikely and often very unlikely. Dr Ioannidis argues that in his field, epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.

With this in mind, consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5% — that is, 45 of them — will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.

The negative results are much more trustworthy; for the case where the power is 0.8 there are 875 negative results of which only 20 are false, giving an accuracy of over 97%. But researchers and the journals in which they publish are not very interested in negative results. They prefer to accentuate the positive, and thus the error-prone. Negative results account for just 10-30% of published scientific literature, depending on the discipline. This bias may be growing. A study of 4,600 papers from across the sciences conducted by Daniele Fanelli of the University of Edinburgh found that the proportion of negative results dropped from 30% to 14% between 1990 and 2007. Lesley Yellowlees, president of Britain’s Royal Society of Chemistry, has published more than 100 papers. She remembers only one that reported a negative result.

Statisticians have ways to deal with such problems. But most scientists are not statisticians. Victoria Stodden, a statistician at Columbia, speaks for many in her trade when she says that scientists’ grasp of statistics has not kept pace with the development of complex mathematical techniques for crunching data. Some scientists use inappropriate techniques because those are the ones they feel comfortable with; others latch on to new ones without understanding their subtleties. Some just rely on the methods built into their software, even if they don’t understand them.

Statistical error, is a thing I have heard about before. As for myself, I could have made some mistakes when I started to play with data, e.g., by forgetting to check the distribution normality or presence of outliers in my data variables. But when remembering that, I noticed that in most papers I have read so far, in the method section for instance, there was no mention of normality distribution either with regard to the variables or residuals (in the case of regression analyses), and not to mention outliers. This makes me believe that there is a possibility that those factors may have been simply neglected by the authors. That’s a serious problem. Also annoying is the common use of inappropriate methods (see, Erceg-Hurn & Mirosevich, 2008) partly because scientists may not be aware of the better methods newly proposed.

The last sentence in the above cited paragraph makes me think about something. Softwares, such as Stata or SPSS, are supposed to work through a specific syntax for computing or running correlation, regression, ANOVA, factor analysis, and so forth. But I remember well at the beginning that I had a lot problems with syntax in SPSS. I don’t know about scientists. Just imagine that they have typed the wrong syntax ? And worse, when researchers analyzed some survey data, say, NLSY, ECLS, Add Health, one would wonder why they do not tell us which variable they have used. Or which sampling weight they use (if they use it at all). Just name the variable ‘label’ does not require so much effort. But somehow, I can kind of understand. If they picked the wrong one, and show it, that’s the end. And this is not unlikely, because some survey data have multiple variables scattered everywhere, and some (sadly) must be collapsed into a single one. And that to choose the correct one, we need to read the codebooks (and note the “s” in the word) which may total more than 1000 pages. This is time-consuming and really exhausting. Sometimes, the data set is a real mess, let alone the fact that the variables do not always exclude missing answers (i.e., values) so that we need to correct this manually using the appropriate syntax, or otherwise the analysis is messed up. It is regrettable that the “method” section is generally so obscure about this. It is impossible to guess how they deal with the data. The less information the researchers give, the less likely someone else will discover the origins of the flaws. This is rather bothersome when researchers try to replicate another team of researchers when both of them do not mention the variables used or the procedure.

I tried a few times to request syntax from authors or ask which variables they use, but my success rate was 0% by now and I do not expect it to increase at a later date. Asking data is generally even worse. This is obvious. The more data we share, and the likelihood of being countered increases. As for myself, although I am not scientist, and hence not comparable at all, I always share EXCEL spreadsheets and syntax. I must be stupid. The sad thing is that I will continue.

This fits with another line of evidence suggesting that a lot of scientific research is poorly thought through, or executed, or both. The peer-reviewers at a journal like Nature provide editors with opinions on a paper’s novelty and significance as well as its shortcomings. But some new journals — PLoS One, published by the not-for-profit Public Library of Science, was the pioneer — make a point of being less picky. These “minimal-threshold” journals, which are online-only, seek to publish as much science as possible, rather than to pick out the best. They thus ask their peer reviewers only if a paper is methodologically sound. Remarkably, almost half the submissions to PLoS One are rejected for failing to clear that seemingly low bar.

PloS One is a journal from where I pick a lot of studies, as I see a lot of PloS One study. But what I believed to be a high-quality journal was very far from the reality. But more problematic is the apparent growing suspicions of fraud.

The number of retractions has grown tenfold over the past decade. But they still make up no more than 0.2% of the 1.4m papers published annually in scholarly journals. Papers with fundamental flaws often live on. Some may develop a bad reputation among those in the know, who will warn colleagues. But to outsiders they will appear part of the scientific canon.

The following paragraph however is more annoying and surprised me a little, to the extent that I consider that a peer-reviewed journal has a duty to check the article carefully and to provide a severe critique. I have imagined a jungle inhabited by ferocious beasts. But here’s the reality :

The idea that there are a lot of uncorrected flaws in published studies may seem hard to square with the fact that almost all of them will have been through peer-review. This sort of scrutiny by disinterested experts — acting out of a sense of professional obligation, rather than for pay — is often said to make the scientific literature particularly reliable. In practice it is poor at detecting many types of error.

John Bohannon, a biologist at Harvard, recently submitted a pseudonymous paper on the effects of a chemical derived from lichen on cancer cells to 304 journals describing themselves as using peer review. An unusual move; but it was an unusual paper, concocted wholesale and stuffed with clangers in study design, analysis and interpretation of results. Receiving this dog’s dinner from a fictitious researcher at a made up university, 157 of the journals accepted it for publication.

Dr Bohannon’s sting was directed at the lower tier of academic journals. But in a classic 1998 study Fiona Godlee, editor of the prestigious British Medical Journal, sent an article containing eight deliberate mistakes in study design, analysis and interpretation to more than 200 of the BMJ’s regular reviewers. Not one picked out all the mistakes. On average, they reported fewer than two; some did not spot any.

Another experiment at the BMJ showed that reviewers did no better when more clearly instructed on the problems they might encounter. They also seem to get worse with experience. Charles McCulloch and Michael Callaham, of the University of California, San Francisco, looked at how 1,500 referees were rated by editors at leading journals over a 14-year period and found that 92% showed a slow but steady drop in their scores.

As well as not spotting things they ought to spot, there is a lot that peer reviewers do not even try to check. They do not typically re-analyse the data presented from scratch, contenting themselves with a sense that the authors’ analysis is properly conceived. And they cannot be expected to spot deliberate falsifications if they are carried out with a modicum of subtlety.

Fraud is very likely second to incompetence in generating erroneous results, though it is hard to tell for certain. Dr Fanelli has looked at 21 different surveys of academics (mostly in the biomedical sciences but also in civil engineering, chemistry and economics) carried out between 1987 and 2008. Only 2% of respondents admitted falsifying or fabricating data, but 28% of respondents claimed to know of colleagues who engaged in questionable research practices.

Peer review’s multiple failings would matter less if science’s self-correction mechanism — replication — was in working order. Sometimes replications make a difference and even hit the headlines — as in the case of Thomas Herndon, a graduate student at the University of Massachusetts. He tried to replicate results on growth and austerity by two economists, Carmen Reinhart and Kenneth Rogoff, and found that their paper contained various errors, including one in the use of a spreadsheet.

I was used to believe that reviewers were much more picky than this. Again, this seems to seriously depart from the reality. After thinking about it however, it may not be surprising, as I heard here and there that scientists are always busy (teaching courses, conferences). Given this, it is likely that most reviewers will not proceed to an examination with much scrutiny. And yet this is not an excuse for not having noticed that the use of no-contact control group, in experiments studying the effect of working memory training on general intelligence, upwardly biases the effect size, known in this field as the Hawthorne (placebo) effect. Every scientist must know this.

I have even spotted some papers which had misreported their own numbers. For instance, the text says there is a relationship between x and y, gives a number, but the table reports otherwise, or they simply made an affirmation opposite to what is presented in their own tables. Somehow, this is rather confusing. This is mostly annoying when several persons were working together and that no one detected the error. It makes me think about whether or not some numbers included in the analyses could have been misreported sometimes. I noted that in this article recently. A slight modification of the numbers in one of the column vector can greatly affect the magnitude of the correlation, at least, when n is small. And obviously, the conclusion drawn from such analyses will be biased by misreported numbers.

With regard to fraud, I believe, at least in the topic of intelligence, genetics and race, that studies supporting hereditarianism or what some people call racialism are costly, and even more so when those studies were fraudulent. However, a study purporting to disrupt the hereditarian argument by misconduct and falsification, in my opinion, has certainly much less to fear about public’s opinion, including the peers, in comparison.

Anyway, in what way this article (The Economist, not mine) is useful is to remind naive people not to rely too much on a single study. I see this kind of things a million of times, usually blogs and forums. Those people do not care to provide a list of studies, or at least, a review. Meta-analysis, because is based on a theory of data (Hunter & Schmidt, 2004, p. 30), is a good tool in the way that it helps to better understand the variability between the studies, and hence helps to understand how an effect size can be maximized (e.g., through the detection of moderators). But this needs to look beyond the level of a single study.

Leave a comment