Evidence use

Evidence and Timetables

We’re approaching that time of year when attention turns to timetables.

Deciding what to prioritise involves integrating our values with considerations of effectiveness and logistical constraints. Effectiveness, however, could mean multiple things: are we maximising attainment, minimising workload or simply ensuring no one is too unhappy? 

Writing for Tes, I’ve highlighted three insights that I think we can discern from the evidence base.

Evidence use

How to use evidence properly

In a recent post, I described how the term evidence-informed practice risks losing its meaning as it becomes more widespread.

Using evidence sounds sensible, but how, precisely, can it add value to our work?

Like many people, I think evidence use involves close consideration of our context, high-quality evidence and professional judgement – but this is too vague. I think evidence can add value to our work by helping us make four decisions. I consider these essentially the mechanisms that lead to evidence-informed practice.

  1. Deciding what to do
  2. Deciding what to do exactly
  3. Deciding how to do things
  4. Deciding if things work

Deciding what to do

Evidence can help us to decide where to focus our effort. The EEF’s Toolkit has hugely influenced these decisions, and the phrase ​‘best bets’ is now widely used when discussing evidence.

The main currency in school is the time of teachers and pupils. Therefore, it is crucial to recognise that evidence can also identify some activities that, on average, had a relatively low impact in the past.

Although this is where people often start with evidence, I think that this is one of the most challenging ways to add value with evidence. The EEF’s implementation process, particularly the explore stage, is very helpful here, but I think the expertise to use it well is spread thinly.

Deciding what to do exactly

‘It ain’t what you do; it’s the way that you do it’ is another phrase famous amongst people interested in evidence use. I’m not sure I fully appreciated what this meant for a long time. But the popularisation of approaches like retrieval practice has taught me that quality is everything.

School leaders need to define quality. This is necessary to move from a vision to a shared vision, to shared practice. If this is not done well, common issues include superficial compliance, the drift of ideas over time, and difficulties with monitoring and evaluation.

Interestingly, different forms of evidence can help define quality, including observation and reflection, which underpin approaches like Teach Like a Champion and Walkthrus.

Randomised controlled trials alone are rubbish at building theories – though they are still needed to test them. Resources like the EEF’s Toolkit and Guidance Reports helpfully identify areas where teachers should focus their efforts. Still, the insights are rarely granular enough to decide what to do exactly – we need professional judgement.

Deciding how to do things

A striking finding from the work of the EEF and other organisations is that how things are done is just as important as what is done. A striking observation when working with many schools is that some schools can take an idea that does not seem very promising but make it work because they excel at implementation. The reverse is also true.

A common challenge when trying to do things in school is that we have skipped over the first stage. We have not defined – with precision and depth – what quality looks like, which creates all sorts of problems, including that it is impossible to ​‘faithfully adopt’ and ​‘intelligently adapt’.

Consistency is then sometimes pursued as a goal for its own sake. I think this is dangerous without a clear conception of quality and how consistency can add value. It shares some troubling characteristics with fanaticism: when someone redoubles their efforts after they have forgotten their aims.

Deciding if something works

I described in my previous post that I think evaluation is tremendously challenging to do in schools because the signal-to-noise ratio is so poor: most things we do in school have a small to modest impact, yet many other factors influence outcomes we care about. Therefore, it is exceptionally hard to isolate the effects of specific actions.

Ultimately, if we rely only on best bets, we are gambling. The best way to protect ourselves from a net negative impact of any policy is to find out if it is working as we hope in our schools with our pupils.

So what?

I think focusing more on how evidence can add value to our work in schools is essential. Crucially, these different mechanisms of evidence use, or decisions, require different forms of evidence and tools. They also make different assumptions.

If we are serious about using evidence properly in schools, we need to get a lot more interested in the detail of how evidence adds value.

Evidence use

How comparable are Ofsted grades?

What should Ofsted do, and how should they do it?

It’s a perennial question in education. An answer that will often come up in these conversations is that Ofsted gives vital information to parents about the quality of schools.

I am sceptical that Ofsted currently does this very well for two reasons:

  1. It’s tricky comparing grades between years – particularly as the nature of inspections shifts even within a year
  2. It’s not possible to compare inspections between different frameworks

How big is the issue?

Armed with my scepticism, I explored these issues by comparing secondary schools within every parliamentary constituency. I chose constituencies as the unit of analysis since it is a reasonable approximation of the schools a family chooses between. Constituencies have a median of 6 secondary schools (interquartile range: 5-7; range: 2-15).

I turned my two concerns into three binary questions that could be answered for each constituency:

  1. Are there the same grades but from different years?
  2. Are there the same grades but using different frameworks?
  3. Is there an Outstanding on the old framework and a good on the new framework?

I found that the first barrier affects nine out of 10 constituencies. Two-thirds of constituencies are affected by the second barrier; one-third are affected by the final barrier.

Some examples

Let’s look at some examples. Bermondsey and Old Southwark has nine secondary schools – which one is the best? One of the four Outstanding schools, right? Except only half of the previously exempt Outstanding schools have retained their grades so far.

The only inference I would feel confident with is that it looks like any of the choices will be quite a good one – which is fortunate – but it’s hard to argue that Ofsted is doing an excellent job for the families of Bermondsey and Old Southwark.

SchoolYear of inspectionFrameworkGrade
Bermondsey and Southwark secondary schools

Let’s look at Beverley and Holderness. It’s quite the contrast: the schools have been inspected using the same framework and within a year of each other, except for School F, which has no grade. This looks like good work by Ofsted: clear, actionable information.

SchoolYear of inspectionFrameworkGrade
Beverley and Holderness secondary schools

So what?

Ofsted’s role looks like it might be reformed in the next few years, as signalled in the recent White Paper’s promised review of regulations and the new HMCI arriving next year will have their own views. Geoff Barton has pre-empted this debate with some interesting thoughts on removing grades.

I’ve previously criticised Ofsted for not having a published theory of change articulating how their work achieves their desired goals while mitigating the apparent risks. Ofsted do acknowledge their responsibility to mitigate these risks in their recently published strategy.

If Ofsted had a clear theory of change, then informing the parental choice of schools would very likely be part of it. The information presented here suggests that Ofsted are not currently doing a great job of this. In some ways, these issues are inevitable given that around 800 secondaries have an inspection on the new framework, 1,800 have one on the old framework, and 600 do not have a published grade.

If Ofsted had a clear theory of change, then informing the parental choice of schools would very likely be part of it.

However, if Ofsted conducted focused, area-based inspections, they would effectively mitigate the issues of different years and different frameworks for more families. These inspections would involve inspecting all the schools within an area within a similar timeframe. This would enable more like-with-like comparisons, as is currently the case in Beverley and Holderness. There would always be some boundary cases, but it would make it more explicit for families that they are not comparing like-with-like.

It is still possible to combine this approach with targeted individual inspections of schools based on perceived risks. No doubt this approach would come with some downsides. But if we are serious about giving parents meaningful information to compare schools, why do we inspect schools the way that we do?

A bonus of this approach is that you could evaluate the impact of inspections using a stepped-wedge design – a fancy RCT – where the order that areas are inspected in is randomised.


Do Ofsted grades actually influence choices? Yes. Ofsted themselves are always keen to highlight this. There is also a clear association between grades and school popularity, which we can approximate by calculating the ratio of the number on roll to school capacity. A higher ratio means that the school is more popular. The trend is clear across primary and secondary.

Ofsted gradePrimarySecondary
Requires improvement0.850.81
Serious weaknesses0.790.86
Special measures0.790.74
Evidence use

The apple pie stage of evidence-informed practice

The unstoppable growth of evidence

In 1999, a young, Dr Robert Coe, published an excitingly titled Manifesto for Evidence Based Education. You can still read it; it’s very good. The updated version from 2019 is even better.

With deep foresight, Rob argued that:

“Evidence-based” is the latest buzz-word in education. Before long, everything fashionable, desirable and Good will be “evidence-based”. We will have Evidence-Based Policy and Evidence-Based Teaching, Evidence-Based Training – who knows, maybe even Evidence-Based Inspection…evidence, like motherhood and apple pie, is in danger of being all things to all people’.

Professor Rob Coe

It’s safe to say that we have now reached the apple pie stage of evidence in education: the recent White Paper promises evidence-based policy and Ofsted’s new strategy offers evidence-based inspection.

If you’re reading this, then, like me, you’re among the converted when it comes to the promise of evidence, but my faith is increasingly being tested. For a long time, I thought any moves to more evidence use was unquestionably good, but I now wonder if more evidence use is always a good thing

To see why, look at this sketch of what I think are plausible benefit-harm ratios for five different levels of evidence use.

Level 1: no use

This is what happens when teachers use their intuition and so on to make decisions. This provides a baseline against which we can judge the other levels of evidence use.

Level 2: superficial use

I think in some ways, this is what Rob foresaw. For me, this stage is characterised by finding evidence to justify, rather than inform decisions. The classic example of this is finding evidence to justify Pupil Premium spending long after the decision has been made.

I think this is a fairly harmless activity, and the only real downside is that it wastes time, which is why there is a slightly increased risk of an overall harm. Equally, I think it’s plausible that superficial use might focus our attention on more promising things, which could also increase the likelihood of a net benefit.

Level 3: emerging use

For me, this is where it starts to get risky. I also dare say that if you’re reading this, there’s a good chance that you fall into this category – at least for some things you do. So why do I think there’s such a risk of a net harm? Here are three reasons:

  1. Engaging with evidence is time consuming so there might be more fruitful ways of spending our time.
  2. We might misinterpret the evidence and make some overconfident decisions. There’s decent evidence that retrieval practice is beneficial, but some teachers and indeed whole schools have used this evidence to justify spending significant portions of lessons retrieving prior learning, which everything I know about teaching tells me is probably not helpful. This is an example of what some people have called a lethal mutation.
  3. There’s also a risk of overconfident decision-making. If we think that ​‘the evidence says’ we should do a particular approach, then there is a risk that we keep going at it despite the signals that it’s not having the benefit we hope.

Of course, even emerging evidence use may be immensely beneficial. I think there are three basic mechanisms by which evidence can help us to be more effective:

  1. Deciding what to do – for instance, the EEF’s tiered model guides schools towards focusing on quality teaching, targeted interventions and wider strategies with the most effort going into quality teaching.
  2. Deciding what to do exactly – what is formative assessment exactly? This is a question I routinely ask teachers and the answers are often quite vague. Evidence can help us define quality.
  3. Deciding how to do things – a key insight from the EEF’s work is that both the what and the how matter. Effective implementation and professional development can be immensely valuable.

The interplay of these different mechanisms, and other factors, will determine for any single decision whether the net impact of evidence use is beneficial or harmful.

Level 4: developing use

At this level, we’re likely spending even more time engaging with evidence. But we’re also likely reaping more rewards.

I think the potential for dramatic negative impacts starts to be mitigated by better evaluation. At level three, we were relying on ​‘best bets’, but we had little idea of whether it was actually working in our setting. Although imperfect, some local evaluation protects us from larger net harms.

Level 5: sophisticated use

Wow – we have become one with research evidence. At this stage, we become increasingly effective at maximising the beneficial mechanisms outlined in level 3, but we do this with far greater quality.

Crucially, the quality of local evaluation is even better, which almost completely protects us from net harm – particularly over the medium to long-term. Also, at this stage, the benefits arguably become cumulative meaning that things get better and better over time – how marvellous!

So what?

The categories I’ve outlined are very rough and you will also notice that I have cunningly avoided offering a full definition of what I even mean by evidence-informed practice.

I think there are lots of implications of these different levels of evidence use, but I’ll save them for another day. What do you think? Is any evidence use a good thing? Am I being a needless gatekeeper about evidence use? Do these different levels have implications for how we should use evidence?

A version of this blog was originally published on the Research Schools Network site.

Evidence use

UCL trains as many teachers as the smallest 57 providers

Last year, I wrote two pieces about the potential of the ITT market review. I outlined two ways that the review could be successful in its own terms:

  1. Less effective providers would be removed from the market and replaced by others that are better – either new entrants to the market or existing stronger providers expanding.
  2. All providers make substantial improvements to their programme so that they achieve fidelity to the Core Content Framework and the like.

This week, we found out that 80 out of 212 applicants in the first round were successful. Schools Week described this as ‘savage’. Although a piece in the Tes suggests that many providers missed out due to minor issues or technicalities like exceeding the word limit.

A lot of stories led with the 80 successful providers figure and highlighted that this risks creating a situation where there are not enough providers. The graph below shows the number of teachers each provider trains – each bar is a different provider and the grey line is the cumulative percentage. An obvious take away is that providers vary massively by size.

One way to think about this is to look at the extremes. In 2021, there were 233 providers, and if we split these up into thirds:

  • The largest providers trained 27,000
  • The middle providers trained 5,000
  • The smallest providers trained 2,500

So instead of asking what proportion of providers got through, a more useful question might be what proportion of the capacity got through?

We can look at the same data with a tree map, only this time the shading highlights the type of provider. The universities are shown in light grey, while SCITTs are shown in darker grey. I’ve chosen to highlight this because while the DfE are clear that they are neutral on this matter, if you do consider size, the providers split fairly well into two camps.

So what?

This issue shows that it’s worth looking beyond the average.

I also think this dramatic variation in provider size suggests that maybe we haven’t got the process right. At the extreme end, the largest provider, UCL, trains the same number of teachers as the smallest 57 providers. I suspect that there is dramatic variation within the providers – should we try to factor this into the process?

Are we managing risks and allocating scrutiny rationally with a single approach that does not factor in the size of the organisation? Should we review different things as the scale that organisations work at varies since new risks, opportunities and challenges arise with scale?  

What else?

Averages are alluring, but headteachers will rightly care about what is going on in their area. I’m aware of a lot of anecdotal evidence and more systematic evidence that teachers tend not to move around a lot after their initial teacher training – I think this may be particularly true in the north east.

After the second round, there will be some cold spots. After all, there are already… Thinking through how to address this will be critical.

Evidence use

Book review: The Voltage Effect

Writing for Schools Week, I reviewed John A. List’s new book The Voltage Effect.

What connects Jamie’s Italian, an early education centre and Uber? Economist John A. List argues persuasively that the answer is scale, and that it is an issue relevant to everyone.

Scale permeates education, yet is rarely considered systematically. And this means some things fail for predictable reasons. For instance, there is really promising evidence about tuition, but the challenges of scaling this rapidly have proved considerable. At the simplest level, we scale things when we extend them; this might happen across a key stage, subject, whole school, or group of schools.

Scalability is crucial to national conversations such as the current focus on teacher development, Oak National Academy and the expansion of Nuffield Early Language Intervention – now used by two-thirds of primary schools. Sadly, List explains that things usually get worse as they scale: a voltage drop.

Evidence use

What’s the role of phonics in secondary school?

Writing for Tes, I argue that we need to think carefully about how we use phonics in secondary.

The emphasis on phonics in English primary schools has increased dramatically since 2010, which makes the existing evidence less relevant to pupils who didn’t respond well to phonics teaching.

Even in recent research, older, struggling readers were often receiving systematic phonics teaching for the first time, particularly in studies outside of England. At best, these findings overstate the impact that we might expect.

I think there are some specific points about phonics that are interesting here, but I also think this highlights some wider principles about evidence-informed practice.

I think of this as being akin to the half-life of research. I was first introduced to this idea years ago, based on an interpretation of the evidence about peer tutoring offered by Robbie Coleman and Peter Henderson.

Evidence use

Do you really do it already?

Writing for the Research Schools Network, I challenge the notion that ‘we do it already’.

As we continue delivering the Early Career Framework (ECF) programme, we continue listening to our partner schools as well as the national debate. We are using this to both refine our work with schools, and to inform our engagement with national organisations.

One theme we have identified where we think we can clarify – and indeed even challenge schools – is the perception that ​‘we already do this’, ​‘we already know this’, or ​‘we learnt this during initial teacher training’.

As a science teacher, one issue I regularly encounter is that pupils recall doing practicals. They may have completed them at primary school, or perhaps earlier in their time at secondary school. One thing I point out is the difference between being familiar with something – like ​‘we remember doing this’ – and understanding something deeply.

My impression is that a similar phenomenon occurs with the ECF, which is not surprising given the overlap in content with the Core Content Framework that underpins ITT programmes. Indeed, I would argue that a similar phenomenon occurs with most teacher development. As teachers, we are often interested in the next big thing, but as Dylan Wiliam has argued, perhaps we should instead focus on doing the last big thing properly.

One way that I have previously tried to illustrate this point is with assessment for learning. These ideas were scaled up through various government initiatives since the late 1990s such that if you taught during this time, it is unlikely you did not have some professional development about it.

Given this, I sometimes cheekily ask teachers to define it. I might give them a blank sheet of paper and some time to discuss it. There are always some great explanations, but it is also fair to say that this normally challenges teachers. Typically, the features of that teachers first highlight are the more superficial aspects, such as asking lots of questions or using a random name generator.

One thing I point out is the difference between being familiar with something – like ​‘we remember doing this’ – and understanding something deeply

No doubt given longer, we could reach better descriptions, but even experienced teachers can struggle to describe the deep structures of formative assessment, which Wiliam has described as his five principles. I have tried to summarise this by describing different levels of quality shown below. We could argue about what goes into each box – indeed, I think this would be an excellent conversation – but it hopefully illustrates that it is possible to use assessment for learning with different levels of quality.

In addition to these principles, there is also a deep craft to formative assessment. Arguably, this is what separates good from great formative assessment. Thus, it is not totally different things, but it is the sophistication and nuance that matters.

The need for repetition and deep engagement with ideas is not just my opinion, it is a key tenet of the EEF’s professional development guidance report. Further, the EEF evaluated a two-year programme focusing entirely on Embedding Formative Assessment, which led to improvements on GCSE outcomes, which are notoriously difficult to improve in research studies.

This project involved teachers meeting monthly to spend 90-minutes discussing key aspects of formative assessment. Unsurprisingly, some of these teachers too reported that they were familiar with the approach, yet the evidence is clear that this approach to improving teaching was effective.

Finally, there is some truth to the issues raised about repetition, and I think that ECTs and Mentors are right to protest that they have they have encountered some activities before. This is probably not very helpful. However, there is a big difference between just repeating activities and examining a topic in more depth. The DfE have committed to reviewing all programmes ahead of the next cohort in September, and I hope this distinction is recognised.

Level of qualityAssessment for learning example
Level 1: superficial​. Colleagues are aware of some superficial aspects of the approach, and may have some deeper knowledge, but this is not yet reflected in their practice.​“Formative assessment is about copying learning objectives into books, and teachers asking lots of questions. Lollypop sticks and random name generators can be useful.”​
“Formative assessment is about copying learning objectives into books, and teachers asking lots of questions. Lollypop sticks and random name generators can be useful.”​“Formative assessment is about regularly checking pupils’ understanding, and then using this information to inform current and future lessons. Effective questioning is a key formative assessment strategy”​
Level 3: developing. Colleagues have a clear understanding of the key dos and don’ts, which helps them to avoid common pitfalls. However, at this level, it is still possible for these ideas to be taken as a procedural matter. Further, although the approach may be increasingly effective, it is not yet as efficient as it might be. Teachers may struggle to be purposeful and use the approaches judiciously as they may not have a secure enough understanding – this may be particularly noticeable with how the approach may need tailoring to different phases, subjects, or contexts.​“Formative assessment is about clearly defining and sharing learning intentions. Then carefully eliciting evidence of learning and using this information to adapt teaching. Good things to do, include eliciting evidence of learning in a manner that can be gathered and interpreted in an efficient and effective manner; using evidence of learning to decide how much time to spend on activities and when to remove scaffolding. Things to avoid include making inferences about all pupils’ learning using a non-representative sample, such as pupils who volunteer answers; mistaking pupils’ current level of performance with learning.”​
Level 4: sophisticated​. Colleagues have a secure understanding of the mechanisms that lead to improvement and how active ingredients can protect those mechanisms. This allows teachers to purposefully tailor approaches to their context without compromising fidelity to the core ideas.  Further, ideas at this level there is an increasing understanding of the conditional nature of most teaching and that there is seldom a single right way of doing things. Teaching typically involves lots of micro decisions, ‘if x then y’. There is also a growing awareness of trade-offs and diminishing returns to any activity. At this level, there is close thinking to how changes in teaching lead to changes in pupil behaviours, which in turn influence learning.​“I have a strong understanding of Wiliam’s five formative assessment strategies. Formative assessment allows me to become more effective and efficient in three main ways:​
1. Focuses on subject matter that pupils are struggling with​
2. Optimising the level of challenge – including the level of scaffolding – and allowing teachers to move on at an appropriate pace, or to increase levels of support.​
3. Developing more independent and self-aware learners who have a clearer understanding of what success looks like, which they can use to inform their actions in class as well as any homework and revision.”​

Evidence use

Guided play: the problems with the research

Writing for the Tes, I highlight some issues with a recent systematic review about the impact of guided play.

Although the review has many strengths, there are three issues that limit what we can conclude from it.

First, the underying studies are poor, and not much is done to account for this issue.

Second, the definitions used for free play, guided play, and direct instruction are muddled, including the aggregation of business-as-usual with direct instruction. This threatens the research team’s conclusions.

Third, using just 17 studies, the team conduct 12 separate meta-analyses. On closer inspection, the way that the studies are combined is even more questionable.

Evidence use

Social and emotional learning: a methodological hot take

One of my earliest encounters with social and emotional learning as a teacher came in the early 2010s when I removed a faded poster from the mouldy corner of my new classroom.

I was reminded of this experience when Stuart Locke, chief executive of a trust, tweeted his shock that the Education Endowment Foundation advocated social and emotional learning (EEF, 2019b). Stuart based his argument on his own experiences as a school leader during the 2000s and a critical review of some underlying theories (Craig, 2007).

Given this, I decided to look at the evidence for SEL, unsure of what I would find.

Fantasy evidence

When thinking about how strong the evidence is for a given issue, I find it helpful first to imagine what evidence would answer our questions. Two broad questions I have about SEL:

  1. Is investing in SEL cost-effective compared to alternatives?
  2. What are the best ways of improving SEL?

We would ideally have multiple recent studies comparing different SEL programmes to answer these questions. These studies would be conducted to the highest standards, like the EEF’s evaluation standards (EEF, 2017, 2018). Ideally, the array of programmes compared would include currently popular programmes and those with a promising theoretical basis. These programmes would also vary in intensity to inform decisions about dosage.

Crucially, the research would look at a broad array of outcomes, including potential negative side-effects (Zhao, 2017). Such effects matter because there is an opportunity cost to any programme. These evaluations would not only look at the immediate impact but would track important outcomes through school and even better into later life. This is important given the bold claims made for SEL programmes and the plausible argument that it takes some time for the impact to feed through into academic outcomes.

The evaluations would not be limited to comparing different SEL programmes. We would even have studies comparing the most promising SEL programmes to other promising programmes such as one-to-one tuition to understand the relative cost-effectiveness of the programmes. Finally, the evaluations would provide insights into the factors influencing programme implementation (Humphrey et al., 2016b, 2016a).

Any researcher reading this may smile at my naïve optimism. Spoiler: the available evidence does not come close to this. No area of education has evidence like this. Therefore, we must make sense of incomplete evidence.

A history lesson

Before we look at the available evidence for SEL, I want to briefly trace its history based on my somewhat rapid reading of various research and policy documents.

A widely used definition of SEL is that it refers to the process through which children learn to understand and manage emotions, set and achieve positive goals, feel and show empathy for others, establish and maintain positive relationships, and make responsible decisions (EEF, 2019b).

CASEL, a US-based SEL advocacy organisation, identify five core competencies: self-awareness, self-management, social awareness, relationship skills, and responsible decision-making (CASEL, 2022). A challenge with the definition of SEL is that it is slippery. This can lead to what psychologists call the jingle-jangle fallacy. The jingle fallacy occurs when we assume that two things are the same because they have the same names; the jangle fallacy occurs when two almost identical things are taken to be different because they have different names.

Interest in social and emotional learning has a long history, both in academic research and in the working lives of teachers who recognise that their responsibilities extend beyond ensuring that every pupil learns to read and write. In England, the last significant investment in social and emotional learning happened in the late 2000s and was led by Jean Gross CBE (DfE, 2007). By 2010, around 90% of primary schools and 70% of secondary schools used the approach (Humphrey et al., 2010). The programme was called the social and emotional aspects of learning (SEAL) and focused on five dimensions different from those identified by CASEL but with significant overlap.

In 2010, the DfE published an evaluation of the SEAL programme (Humphrey et al., 2010). Unfortunately, the evaluation design was not suitable to make strong claims about the programme’s impact. Before this evaluation, there were five other evaluations of the SEAL programme, including one by Ofsted (2007), which helped to pilot the approach.

In 2010, the coalition government came to power, and the national strategies stopped. Nonetheless, the interest in social and emotional learning arguably remains as a 2019 survey of primary school leaders found that it remained a very high priority for them. However, there were reasonable concerns about the representativeness of the respondents (Wigelsworth, Eccles, et al., 2020).

In the past decade, organisations interested in evidence-based policy have published reports concerning social and emotional learning. Here are twelve.

  • In 2011, an overarching review of the national strategies was published (DfE, 2011).
  • In 2012, NICE published guidance on social and emotional wellbeing in the early years (NICE, 2012).
  • In 2013, the EEF and Cabinet Office published a report on the impact of non-cognitive skills on the outcomes for young people (Gutman & Schoon, 2013)
  • In 2015, the Social Mobility Commission, Cabinet Office, and Early Intervention Foundation published a series of reports concerning the long-term effects of SEL on adult life, evidence about programmes, and policymakers’ perspectives (EIF, 2015).
  • In 2015, the OECD published a report on the power of social and emotional skills (OECD, 2015).
  • In 2017, the US-based Aspen Institute published a scientific consensus statement concerning SEL (Jones & Kahn, 2017).
  • In 2018, the DfE began publishing findings from the international early learning and child wellbeing (IELS) study in England, including SEL measures (DfE, 2018).
  • In 2019, the EEF published a guidance report setting out key recommendations for improving social and emotional learning (EEF, 2019b).
  • In 2020, the EEF published the results of a school survey and an evidence review that supported the 2019 guidance report (Wigelsworth, Eccles, et al., 2020; Wigelsworth, Verity, et al., 2020).
  • In 2021, the Early Intervention Foundation published a systematic review concerning adolescent mental health, including sections on SEL (Clarke et al., 2021).
  • In 2021, the EEF updated its Teaching and Learning Toolkit, which includes a strand on social and emotional learning (EEF, 2021).
  • In 2021, the Education Policy Institute published an evidence review of SEL and recommended more investment, particularly given the pandemic (Gedikoglu, 2021).

The evidence

To make sense of this array of evidence, we need to group it. There are many ways to do this, but I want to focus on three: theory, associations, and experiments.


Theory is perhaps the most complicated. To save my own embarrassment, I will simply point out that social and emotional learning programmes have diverse theoretical underpinnings, and these have varying levels of evidential support. Some are – to use a technical term – a bit whacky, while others are more compelling. A helpful review of some of the theory, particularly comparing different programmes, comes from an EEF commissioned review (Wigelsworth, Verity, et al., 2020). I also recommend this more polemical piece (Craig, 2007).


The next group of studies are those that look for associations or correlations. These studies come in many different flavours, including cohort studies that follow a group of people throughout their lives like the Millennium Cohort Study (EIF, 2015). The studies are united in that they look for patterns between SEL and other outcomes. Still, they share a common limitation: it is hard to identify what causes what. These studies can highlight areas for further investigation, but we should not attach too much weight to them. Obligatory XKCD reference.


Experiments can test causal claims by estimating what would have happened without the intervention and comparing this to what we observe. Experiments are fundamental to science, as many things seem promising when we look at just theory and associations, but when investigated through rigorous experiments are found not to work (Goldacre, 2015).

There are four recent meta-analyses, which have included experiments (Mahoney et al., 2018). These meta-analyses have been influential in the findings from most of the reports listed above. The strength of meta-analysis, when based on a systematic review, is that it reduces the risk of bias from cherry-picking the evidence (Torgerson et al., 2017). It also allows us to combine lots of small studies, which may individually be too small to detect important effects. Plus, high-quality meta-analysis can help make sense of the variation between studies by identifying factors associated with these differences. To be clear, these are just associations, so they need to be interpreted very cautiously, but they can provide important insights for future research and practitioners interested in best bets.

Unfortunately, the meta-analyses include some pretty rubbish studies. This is a problem because the claims from some of these studies may be wrong. False. Incorrect. Mistaken. Researchers disagree on the best way of dealing with studies of varying quality. At the risk of gross oversimplification, some let almost anything in (Hattie, 2008), others apply stringent criteria and end up with few studies to review (Slavin, 1986), while others set minimum standards, but then try to take account of research quality within the analysis (Higgins, 2016).

If you looked at the twelve reports highlighted above and the rosy picture they paint, you would be forgiven for thinking that there must be a lot of evidence concerning SEL. Indeed, there is quite a lot of evidence, but the problem is that it is not all very good. Take one of the most widely cited programmes, PATHS, for which a recent focused review by the What Works Clearinghouse (think US-based EEF) found 35 studies of which:

  • 22 were ineligible for review
  • 11 did not meet their quality standards
  • 2 met the standards without reservations

Using the two studies that did meet the standards, the reviewers concluded that PATHS had no discernible effects on academic achievement, student social interaction, observed individual behaviour, or student emotional status (WWC, 2021). 

Unpacking the Toolkit

To get into the detail, I have looked closely at just the nine studies included in the EEF’s Toolkit strand on SEL with primary aged children since 2010 (EEF, 2021). The date range is arbitrary, but I have picked the most recent studies because they are likely the best and most relevant – the Toolkit also contains studies from before 2010 and studies with older pupils. I chose primary because the EEF’s guidance report focuses on primary too. Note sampling studies from the Toolkit like this avoids bias since the Toolkit itself is based on systematic searches. The forest plot below summarises the effects from the included studies. The evidence looks broadly positive because most of the boxes are to the right of the red line. Note that multiple effects were reported in two studies hence 11 effects, but nine studies for review.

It is always tempting to begin to make sense of studies by looking at the impact, as we just did. But I hope to convince you we should start by looking at the methods. The EEF communicates the security of a finding through padlocks on a scale from 0-5, with five padlocks being the most secure (EEF, 2019a). Of the nine studies, two are EEF-funded studies, but for the remaining seven, I have estimated the padlocks using the EEF’s criteria.

Except for the two EEF-funded studies, the studies got either zero or one padlock. The Manchester (2015) study received the highest security rating and is a very good study: we can have high confidence in the conclusion. The Sloan (2018) study got just two padlocks but is quite compelling, all things considered. Despite being a fairly weak study by the EEF’s standards, it is still far better than the other studies.  

The limitations of the remaining studies are diverse, but recurring themes include:

  • High attrition – when lots of participants are randomised but then not included in the final analysis, this effectively ruins the point of randomisation (IES, 2017a).
  • Few cases randomised – multiple studies only randomised a few classrooms, and the number of cases randomised has a big impact on the security of a finding (Gorard, 2013).
  • Poor randomisation – the protocols for randomisation are often not specified, and it is not always possible to assess the integrity of the randomisation process (IES, 2017b)
  • Self-reported outcomes – most studies used self-reported outcomes from pupils or teachers, which are associated with inflated effect sizes (Cheung & Slavin, 2016). The EEF’s studies have also consistently shown that teacher perceptions of impact are poor predictors of the findings from evaluations (Stringer, 2019).
  • Unusual or complex analysis choices – many studies include unusual analysis choices that are not well justified, like dichotomising outcome variables (Altman & Royston, 2006). Further, the analyses are often complex, and without pre-specification, this gives lots of ‘researcher degrees of freedom’ (Simmons et al., 2011).
  • Incomplete reporting – the quality of reporting is often vague about essential details. It is difficult to properly assess the findings’ security or get a clear understanding of the exact nature of the intervention (Hoffmann et al., 2014; Montgomery et al., 2018).
  • Social threats to validity – where classes within a school are allocated to different conditions, there is a risk of social threats to validity, like resentful demoralisation, which were not guarded against or monitored (Shadish et al., 2002).

The SEL guidance report

Stuart’s focus was originally drawn to the Improving Social and Emotional Learning in Primary Schools guidance report (EEF, 2019b). A plank of the evidence base for this guidance report was the EEF’s Teaching and Learning Toolkit. At the time, the toolkit rated the strand as having moderate impact for moderate cost, based on extensive evidence (EEF, 2019b). Since the major relaunch of the Toolkit in 2021, the estimated cost and impact for the SEL strand have remained the same, but the security was reduced to ‘very limited evidence’ (EEF, 2021). The relaunch involved looking inside the separate meta-analyses that made up the earlier Toolkit and getting a better handle on the individual studies (TES, 2021). In the case of the SEL strand, it appears to have highlighted the relative weakness of the underlying studies.

Being evidence-informed is not about always being right. It is about making the best possible decisions with the available evidence. And as the evidence changes, we change our minds. For what it is worth, my view is that given the strong interest among teachers in social and emotional learning, it is right for organisations like the EEF to help schools make sense of the evidence – even when that evidence is relatively thin.

This rapid deep dive into the research about SEL, has also given me a necessary reminder that from time-to-time it is necessary to go back to original sources, rather than only relying on summaries. For instance, the EEF’s recent cognitive science review found just four studies focusing on retrieval practice that received an overall rating of high, which I know many people are surprised to learn given the current interest in using it (Perry et al., 2021).

Final thoughts

I’ll give the final word to medical statistician Professor Doug Altman: we need less research, better research, and research done for the right reasons (Altman, 1994).


Altman, D. G. (1994). The scandal of poor medical research. BMJ, 308(6924), 283–284.

Altman, D. G., & Royston, P. (2006). Statistics Notes: The cost of dichotomising continuous variables. BMJ : British Medical Journal, 332(7549), 1080.

Ashdown, D. M., & Bernard, M. E. (2012). Can Explicit Instruction in Social and Emotional Learning Skills Benefit the Social-Emotional Development, Well-being, and Academic Achievement of Young Children? Early Childhood Education Journal, 39(6), 397–405.

Bavarian, N., Lewis, K. M., Dubois, D. L., Acock, A., Vuchinich, S., Silverthorn, N., Snyder, F. J., Day, J., Ji, P., & Flay, B. R. (2013). Using social-emotional and character development to improve academic outcomes: a matched-pair, cluster-randomized controlled trial in low-income, urban schools. The Journal of School Health, 83(11), 771–779.

Brackett, M. A., Rivers, S. E., Reyes, M. R., & Salovey, P. (2012). Enhancing academic performance and social and emotional competence with the RULER feeling words curriculum. Learning and Individual Differences, 22(2), 218–224.

CASEL. (2022). Advancing Social and Emotional Learning.

Cheung, A. C. K., & Slavin, R. E. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45(5), 283–292.

Clarke, A., Sorgenfrei, M., Mulcahy, J., Davie, P., Friedrick, C., & McBride, T. (2021). Adolescent mental health: A systematic review on the effectiveness of school-based interventions | Early Intervention Foundation.

Craig, C. (2007). The potential dangers of a systematic, explicit approach to teaching social and emotional skills (SEAL) An overview and summary of the arguments.

DfE. (2007). Social and emotional aspects of learning for secondary schools (SEAL).

DfE. (2011). The national strategies 1997 to 2011. In 2011.

DfE. (2018). International early learning and child wellbeing: findings from the international early learning and child wellbeing study (IELS) in England.

EEF. (2017). EEF standards for independent evaluation panel members.

EEF. (2018). Statistical analysis guidance for EEF evaluations.

EEF. (2019a). Classification of the security of findings from EEF evaluations.

EEF. (2019b). Improving Social and Emotional Learning in Primary Schools.

EEF. (2021). Teaching and learning toolkit: social and emotional learning.

EIF. (2015). Social and emotional learning: skills for life and work.

Gedikoglu, M. (2021). Social and emotional learning: An evidence review and synthesis of key issues – Education Policy Institute.

Goldacre, B. (2015). Commentary: randomized trials of controversial social interventions: slow progress in 50 years. International Journal of Epidemiology, 44(1), 19–22.

Gorard, S. (2013). Research design: creating robust approaches for the social sciences (1st ed.). SAGE.

Gutman, L. M., & Schoon, I. (2013). The impact of non-cognitive skills on outcomes for young people Literature review.

Hattie, J. (2008). Visible learning: a synthesis of over 800 meta-analyses relating to achievement. Routledge.

Higgins, S. (2016). Meta-synthesis and comparative meta-analysis of education research findings: some risks and benefits. Review of Education, 4(1), 31–53.

Hoffmann, T. C., Glasziou, P. P., Boutron, I., Milne, R., Perera, R., Moher, D., Altman, D. G., Barbour, V., Macdonald, H., Johnston, M., Lamb, S. E., Dixon-Woods, M., McCulloch, P., Wyatt, J. C., Chan, A.-W., & Michie, S. (2014). Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide. BMJ (Clinical Research Ed.), 348, g1687.

Humphrey, N., Lendrum, A., Ashworth, E., Frearson, K., Buck, R., & Kerr, K. (2016a). Implementation and process evaluation (IPE) for interventions in education settings: A synthesis of the literature.

Humphrey, N., Lendrum, A., Ashworth, E., Frearson, K., Buck, R., & Kerr, K. (2016b). Implementation and process evaluation (IPE) for interventions in education settings: An introductory handbook.

Humphrey, N., Lendrum, A., & Wigelsworth, M. (2010). Social and emotional aspects of learning (SEAL) programme in secondary schools: national evaluation . In 2010.

IES. (2017a). Attrition standard.

IES. (2017b). What Works ClearinghouseTM Standards Handbook (Version 4.0).

Jones, S. M., Brown, J. L., Hoglund, W. L. G., & Aber, J. L. (2010). A School-Randomized Clinical Trial of an Integrated Social-Emotional Learning and Literacy Intervention: Impacts After 1 School Year. Journal of Consulting and Clinical Psychology, 78(6), 829–842.

Jones, S. M., & Kahn, J. (2017). The Evidence Base for How We Learn Supporting Students’ Social, Emotional, and Academic Development Consensus Statements of Evidence From the Council of Distinguished Scientists National Commission on Social, Emotional, and Academic Development.

Mahoney, J. L., Durlak, J. A., & Weissberg, R. P. (2018). An update on social and emotional learning outcome research –

Manchester. (2015). Promoting Alternative Thinking Strategies | EEF. In 2015.

Montgomery, P., Grant, S., Mayo-Wilson, E., Macdonald, G., Michie, S., Hopewell, S., Moher, D., & CONSORT-SPI Group. (2018). Reporting randomised trials of social and psychological interventions: the CONSORT-SPI 2018 Extension. Trials, 19(1), 407.

Morris, P., Millenky, M., Raver, C. C., & Jones, S. M. (2013). Does a Preschool Social and Emotional Learning Intervention Pay Off for Classroom Instruction and Children’s Behavior and Academic Skills? Evidence From the Foundations of Learning Project. Early Education and Development, 24(7), 1020.

NICE. (2012). Overview | Social and emotional wellbeing: early years | Guidance | NICE. NICE.

OECD. (2015). Skills Studies Skills for Social Progress: the power of social and emotional skills.

Ofsted. (2007). Developing social, emotional and behavioural skills in secondary schools.

Perry, T., Lea, R., Jorgenson, C. R., Cordingley, P., Shapiro, K., & Youdell, D. (2021). Cognitive science approaches in the classroom: evidence and practice review.

Schonfeld, D. J., Adams, R. E., Fredstrom, B. K., Weissberg, R. P., Gilman, R., Voyce, C., Tomlin, R., & Speese-Linehan, D. (2014). Cluster-randomized trial demonstrating impact on academic achievement of elementary social-emotional learning. School Psychology Quarterly : The Official Journal of the Division of School Psychology, American Psychological Association, 30(3), 406–420.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalised causal inference. Houghton Miffin.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

Slavin, R. E. (1986). Best-Evidence Synthesis: An Alternative to Meta-Analytic and Traditional Reviews. Educational Researcher, 15(9), 5–11.

Sloan, S., Gildea, A., Miller, S., & Thurston, A. (2018). Zippy’s Friends.

Snyder, F., Flay, B., Vuchinich, S., Acock, A., Washburn, I., Beets, M., & Li, K. K. (2010). Impact of a social-emotional and character development program on school-level indicators of academic achievement, absenteeism, and disciplinary outcomes: A matched-pair, cluster randomized, controlled trial. Journal of Research on Educational Effectiveness, 3(1), 26.

Stringer, E. (2019). Teacher training – ​the challenge of change.

TES. (2021). Toolkit puts “best bets” at teachers’ fingertips. TES.

Torgerson, C., Hall, J., & Lewis-Light, K. (2017). Systematic reviews. In R. Coe, M. Waring, L. Hedges, & J. Arthur (Eds.), Research methods and methodologies in education (2nd ed., pp. 166–179). SAGE.

Wigelsworth, M., Eccles, A., Mason, C., Verity, L., Troncoso, P., Qualter, P., & Humphrey, N. (2020). Programmes to Practices: Results from a Social & Emotional School Survey.

Wigelsworth, M., Verity, L., Mason, C., Humphrey, N., Qualter, P., & Troncoso, P. (2020). Programmes to practices: identifuing effective, evidence-based social and emotional learning strategies for teachers and schools: evidence review.

WWC. (2021). Promoting Alternative THinking Strategies (PATHS).

Zhao, Y. (2017). What works may hurt: Side effects in education. Journal of Educational Change, 18(1), 1–19.