Evidence use

Effect sizes

Effect sizes are a popular way of communicating research findings. They can move beyond binary discussions about whether something ‘works’ or not and illuminate the magnitude of differences.

Famous examples of effect sizes include:

  • The Teaching and Learning Toolkit’s months’ additional progress
  • Hattie’s dials and supposed ‘hinge point’ of 0.4

Like anything, it is possible to use effect sizes more or less effectively. Still, considering these four questions will ensure intelligent use.

What type of effect size is it?

There are two fundamentally different uses of effect sizes. One communicates information about an association; the other focuses on interventions. Confusing the two effect sizes leads to the classic statistical mistake of confusing correlation with causation.

Understanding the strength of associations, or correlations, is important. It is often the first step to learning more about phenomena. For instance, knowing that there is a strong association between parental engagement and educational achievement is illuminating. However, this association is very different from the causal claim that improving parental engagement can improve school achievement (See & Gorard, 2013). Causal effect sizes are more common in education; we will focus on them with the remaining questions.

How did the overall study influence the effect size?

It is tempting to think that effect sizes tell us something absolute about a specific intervention. They do not. A better way to think of effect sizes is as properties of the entire study. This does not make effect sizes useless, but they need more judgement to make sense of them than it may first appear.

Let’s look at the effect sizes from three EEF-funded trials (Dimova et al., 2020; Speckesser et al., 2018; Torgerson et al., 2014):

All these programmes seem compelling, and Using Self-Regulation to Improve Writing appears the best. These are the two obvious – and I think incorrect – conclusions that we might draw. These studies helpfully illustrate the importance of looking at the whole study when deciding the meaning of any effect size.

1. Some outcomes are easier to improve than others. 

The more closely aligned an outcome is to the intervention, the bigger the effects we can expect (Slavin & Madden, 2011). So we would expect a programme focusing on algebra to report larger effects for algebra than for mathematics overall. This is critical to know when appraising outcomes that have been designed by the developers of interventions. In extreme cases, assessments may focus on topics that only the intervention group have been taught!

There’s also reason to think that some subjects may be easier to improve than others. For instance, writing interventions tend to report huge effects (Graham, McKeown, Kiuhara, & Harris, 2012). is there something about writing that makes it easier to improve?

2. If the pupils are very similar, the effects are larger.

To illuminate one reason, consider that around 13 per cent of children in the UK have undiagnosed vision difficulties (Thurston, 2014). Only those children with vision difficulties can possibly benefit from any intervention to provide glasses. If you wanted to show your intervention was effective, you would do everything possible to ensure that only children who could benefit were included in the study. Other pupils dilute the benefits.

3. Effects tend to be larger with younger children.  

Young children tend to make rapid gains in their learning. I find it extraordinary how quickly young children learn to read, for example. 

A more subtle interpretation I’ve heard Professor Peter Tymms advocate is to think about how deep into a subject pupils have reached. This may explain the large effects in writing interventions. In my experience, the teaching of writing is typically much less systematic than reading. Perhaps many pupils are simply not very deep into learning to write so make rapid early gains when writing is given more focus.

4. More rigorous evaluations produce smaller effects.

A review of over 600 effect sizes found that random allocation to treatment conditions is associated with smaller effects (Cheung & Slavin, 2016). Effects also tend to be smaller when action is taken to reduce bias, like the use of independent evaluations (Wolf, Morrison, Inns, Slavin, & Risman, 2020). This is probably why most EEF-funded trials – with their exacting standards (EEF, 2017) – find smaller effects than the earlier research summarised in the Teaching and Learning Toolkit.

5. Scale matters

A frustrating finding in many research fields is that as programmes get larger, effects get smaller. One likely reason is fidelity. A fantastic music teacher who has laboured to create a new intervention is likely much better at delivering it than her colleagues. Even if she trained her colleagues, they would likely remain less skilled and motivated to make it work. Our music teacher is an example of super realisation bias that can distort small scale research studies.

Returning to our three EEF-funded studies, it becomes clear that our initial assumption that IPEELL was the most promising programme may be wrong. My attempt at calibrating each study against the five issues is shown below. The green arrows indicate we should consider mentally ‘raising’ the effect size. In contrast, the red arrows suggest ‘lowering’ the reported effect sizes. 

This mental recalibration is imprecise, but accepting the uncertainty may be useful.

How meaningful is the difference?

Education is awash with wild claims. Lots of organisations promise their work is transformational. Perhaps it is, but the findings from rigorous evaluations suggest that most things do not make much difference. A striking fact is that just a quarter of EEF-funded trials report a positive impact.

Historically, some researchers have sought to give benchmarks to guide interpretations of studies. Although they are alluring, they’re not very helpful. A famous example is Hattie’s ‘hinge point’ of 0.4, which was the average from his Visible Learning project (Hattie, 2008). However, the included studies’ low quality inflates the average; the contrast with the more modest effect sizes from rigorous evaluations is clear-cut. However, it does highlight the absurdity of trying to compare effect sizes with universal benchmarks.

The graphic below presents multiple representations of the difference found in the Nuffield Early Language Intervention (+3 months’ additional progress) between the intervention and control groups. I created it using this fantastic resource. I recommend using it as the multiple representations and interactive format help develop a more intuitive feeling for effect sizes.

How cost-effective is it?

Thinking about cost often changes what looks like the best bets. Cheap, low impact initiatives may be more cost-effective than higher impact, but more intensive projects. An excellent example is the low impact and ultra-low-cost of texting parents about their children’s learning (Miller et al., 2016).

It is also vital to think through different definitions of cost. In school, time is often the most precious resource.

In summary

Effect sizes are imperfect but used well they have much to offer. Remember to ask:

  • What type of effect size is it?
  • How did the overall study influence the effect size?
  • How meaningful is the difference?
  • How cost-effective is it?

Next steps for further reading

Kraft, M. A. (2020). Interpreting Effect Sizes of Education Interventions. Educational Researcher.

Piper, K. (2018). Scaling up good ideas is really, really hard — and we’re starting to figure out why. Retrieved January 23, 2021, from

Simpson, A. (2018). Princesses are bigger than elephants: Effect size as a category error in evidence-based education. British Educational Research Journal, 44(5), 897–913.


Cheung, A. C. K., & Slavin, R. E. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45(5), 283–292.

Dimova, S., Ilie, S., Brown, E. R., Broeks, M., Culora, A., & Sutherland, A. (2020). The Nuffield Early Language Intervention. London. Retrieved from

EEF. (2017). EEF standards for independent evaluation panel members. Retrieved from

Graham, S., McKeown, D., Kiuhara, S., & Harris, K. R. (2012). A meta-analysis of writing instruction for students in the elementary grades. Journal of Educational Psychology, 104(4), 879–896.

Hattie, J. (2008). Visible learning: a synthesis of over 800 meta-analyses relating to achievement. Abingdon: Routledge.

Miller, S., Davison, J., Yohanis, J., Sloan, S., Gildea, A., & Thurston, A. (2016). Texting parents: evaluation report and executive summary. London. Retrieved from

See, B. H., & Gorard, S. (2013). What do rigorous evaluations tell us about the most promising parental involvement interventions? A critical review of what works for disadvantaged children in different age groups. London: Nuffield Foundation. Retrieved from

Slavin, R. E., & Madden, N. A. (2011). Measures Inherent to Treatments in Program Effectiveness Reviews. Journal of Research on Educational Effectiveness, 4(4), 370–380.

Speckesser, S., Runge, J., Foliano, F., Bursnall, M., Hudson-Sharp, N., Rolfe, H., & Anders, J. (2018). Embedding formative assessment: evaluation report and executive summary. London. Retrieved from

Thurston, A. (2014). The Potential Impact of Undiagnosed Vision Impairment on Reading Development in the Early Years of School. International Journal of Disability, Development and Education, 61(2), 152–164.

Torgerson, D. J., Torgerson, C. J., Ainsworth, H., Buckley, H., Heaps, C., Hewitt, C., & Mitchell, N. (2014). Using self-regulation to improve writing. London. Retrieved from

Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2020). Average Effect Sizes in Developer-Commissioned and Independent Evaluations. Journal of Research on Educational Effectiveness, 13(2), 428–447.