By Emeritus Professor Geoff Cumming MAPS, Statistical Cognition Laboratory, School of Psychological Science, La Trobe University

Geoff Cumming was the recipient of the 2011 APS Distinguished Contribution to Psychology Education Award.

In April 2009, people rushed to Boots pharmacies in Britain to buy No. 7 Protect & Perfect Intense Beauty Serum. They were prompted by media reports of an article in the British Journal of Dermatology stating that the anti-ageing cream “produced statistically significant improvement in facial wrinkles as compared to baseline assessment (p = .013), whereas [placebo-treated] skin was not significantly improved (p = .11)”. The article claimed a statistically significant effect of the cream because p < .05, but no significant effect of the control placebo cream because p > .05. In other words, the cream had an effect, but the control material didn’t.

The errors of statistical significance testing

That’s a basic statistical blunder, to leap from placebo having ‘no significant effect, p > .05’ to ‘the effect is zero, and therefore the two conditions differ’. The researchers should have compared the two conditions directly, and not based their conclusion on finding one statistically significant and the other not.

Psychology and many other disciplines rely on null hypothesis significance testing (NHST) and the hallowed p < .05 to analyse data and reach conclusions. One fundamental problem is that, as in the example above, NHST encourages dichotomous thinking – the world is black or white, the null hypothesis is or is not rejected, an effect exists or it doesn’t. In reality, effects come in every shade of grey – from zero to tiny to small, and all the way up to huge. NHST misleads by imposing an arbitrary cut-off to divide all the greys into only two categories.

It’s amusing to think of crowds chasing dreams of eternal youth, urged on by tabloid reporting of conclusions based on statistical error, but NHST is no laughing matter: it has severely damaging effects on research.

  • Read the best summary of the problems with NHST in Chapter 3 of a book by Rex Kline (2004) [can be accessed at: tinyurl.com/klinechap3]

Estimation, the better way

Fortunately, there is a much better way, and it’s already familiar to just about everyone. If a newspaper reports that “support for the Prime Minister is 37 per cent in a poll with an error margin of two per cent”, most people understand: the 37 per cent is our best estimate of support in the whole population and, most likely, it’s within two per cent of the true value.

Reporting something like 37 ± 2 is a good way to answer many of science’s questions. A chemist reports a melting point as 17 ± 0.2 degrees, and a geologist estimates the age of the Earth as 4.5 ± 0.1 billion years. These are examples of estimation, a statistical strategy widely used in the natural sciences, and in engineering and other applied fields.
The 37 ± 2 defines a range, from 35 to 39 per cent, which, most likely, includes the true value. This is the ‘95% confidence interval’ – we can be 95 per cent confident that the interval [35, 39] calculated from the poll results includes the true value of support for the Prime Minister.

Estimation is highly informative – it tells us what we want to know – and so simple you can report it in a newspaper. It is the way of the future, and is here right now (Cumming, 2012). Why then does psychology continue to cling to NHST? Perhaps because it seems to offer a seductive certainty – declaring a result ‘significant’ seems close to declaring truth, and that the result is large and important. Unfortunately, NHST offers only illusory certainty and says nothing about the size of the effect.

  • An introduction to understanding and using confidence intervals can be found at: tinyurl.com/inferencebyeye.

The dance of the p values

Another NHST problem relates to replication. Replication is central in science – usually, we won’t take any result seriously until it has been replicated a couple of times. An advantage of estimation is that a confidence interval tells us what’s likely to happen on replication. If we ran another poll – the same size, but asking a different sample of people – we’d most likely get a result within the 37 ± 2 confidence interval given by our first poll.

Not so for significance testing! p values are usually calculated to two or even three decimal places, and decisions about significance are based on the precise value. However, a replication experiment is likely to give a very different p value. Significance testing gives almost no information about what’s likely to happen on replication! Few researchers appreciate this problem, which totally undermines any belief – or desperate hope – that significance is a reliable guide to truth.

  • For a simulation of how p values jump around wildly with replication, watch the dance of the p values at: tinyurl.com/danceptrial2

My conclusion is that significance testing gives only a seductive illusion of certainty and is actually extremely unreliable. All round, it’s a terrible idea.

What’s needed for evidence-based practice

Cathy Faulkner investigated what clinical psychologists need to support evidence-based practice (Faulkner, Fidler, & Cumming, 2008). She first asked, by email, a group of leading clinical researchers: “Think of a clinical trial that you designed. What was the most central question(s) it was designed to answer?” Fully 81 per cent replied: “Is there an effect?” Then she asked them to rate the importance of three possible questions: (1) Is there an effect? (2) How large is the effect? (3) How clinically important is the effect? Given those prompts, her expert respondents rated all three as highly important. In other words, their first response (Is there an effect?) was prompted by their automatic dichotomous thinking, no doubt reinforced by a lifetime of NHST. But, when prompted, they immediately recognised that a trial of a psychological therapy is only useful if it tells us how large an effect the therapy is likely to give, and how clinically important that is. So estimation, meaning confidence intervals, is what we need for fullest information about the size of an effect, and the best basis for assessing its clinical importance.

Cathy also examined all 104 reports of randomised control trials (RCTs) of psychological therapies that were published during 1999-2003 in two leading psychology journals – I suspect little has changed since then. Fully 99 per cent used NHST, only five per cent reported confidence intervals, and only 78 per cent even mentioned clinical importance. Such RCTs should be providing the core evidence that clinical psychologists need to shape their evidence-based practice, but they are not doing their job! They rely on NHST and don’t use estimation, so their value for professional practice is compromised.

Meta-analysis to build evidence

Researchers may be reluctant to report confidence intervals because they are often disappointingly long. They are long because people vary so much, and it’s often not practical – or possible – to use sufficiently large samples to get short confidence intervals. So, what do we do? An excellent approach is to combine results from multiple studies to get better estimates – and meta-analysis does exactly that. Meta-analysis integrates evidence over studies to give an overall estimate of the size of the effect of interest, and a confidence interval to tell us how precise that estimate is and how consistent the results from the different studies are.

Meta-analysis is based on estimation. It is becoming widely recognised as the best way to review a research literature and combine evidence over studies to provide the best evidence to underpin evidence-based practice. Of course, meta-analysis can only give an accurate result if all relevant studies are included. Here, alas, significance testing does further damage, because statistical significance has often influenced which studies are published – non-significant results are likely to languish unpublished in file drawers, biasing future meta-analyses by their absence. Switching from NHST to estimation not only gives fuller information about our research, but also helps ensure that future meta-analyses avoid bias by including all relevant studies.

The Cochrane Collaboration is a large online database of research reviews that use meta-analysis to summarise evidence on thousands of important topics in medicine and the health sciences. It is a wonderful achievement and the world’s primary resource for supporting evidence-based practice in medicine. Some psychology is included, but our discipline has yet to develop a comparable resource to support our own evidence-based practice.

  • To explore the Cochrane Collaboration, go to www.cochrane.org and browse, or scroll to the very bottom of the home page and click on ‘The Cochrane Library (Full-text)’. Search for ‘cognitive’ or ‘behavioural’ to find reports of psychological interest.

The new statistics

I refer to estimation and meta-analysis as ‘the new statistics’, not because the techniques are new, but because, for most researchers who currently rely on NHST, using the techniques would be very new – and would require big changes in attitude. But switching to estimation could give great improvements to research, and to the value of that research for practitioners (Cumming, 2012).

  • Radio National broadcast a talk of mine that gives a brief explanation of the new statistics and why we need them. The podcast and transcript are at: tinyurl.com/geofftalk

The latest edition of the Publication Manual of the American Psychological Association (APA, 2010) states unequivocally that interpretation of results should, wherever possible, be based on estimation. This crucial advice is new, and I hope gives a great boost to adoption of the new statistics – which are the way forward for all psychologists.

The author can be contacted at g.cumming@latrobe.edu.au

References

American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Cumming, G. (2012). Understanding the new Statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.

Faulkner, C., Fidler, F., & Cumming, G. (2008). The value of RCT evidence depends on the quality of statistical analysis. Behaviour Research and Therapy, 46, 270-281.

Kline, R.B. (2004). Beyond significance testing. Reforming data analysis methods in behavioral research. Washington D.C.: APA Books.

In Psych June 2012