What’s the point of Ofsted?

In an exclusive for the Sunday Times, Ofsted’s Amanda Spielman revealed that she anticipates the number of Outstanding schools to be roughly halved.

Many of these Outstanding schools have not been inspected for a long time because former Secretary of State for Education Michael Gove introduced an exemption. In effect, this created a one-way road to Outstanding for schools since it was still possible for schools to become Outstanding, but once there it was rare to go back.

To understand the point of Ofsted, I want to explore five ways – or mechanisms – that might lead to improvements.

1. Identifying best practice

Some people argue that Ofsted has a role in identifying the highest performing schools so that other schools can learn from them.

This mechanism relies on some demanding assumptions, including that we can (1) accurately identify the best schools; (2) disentangle the specific practices that make these schools successful; and (3) that this best practice is actually applicable to the context of other schools that might seek to imitate them.

2. Supporting parental choice

The logic here is that parents use Ofsted reports to move their children away from lower rated schools towards higher rated schools. Short-term, this gets more children into better schools. Longer-term, the less effective schools may close, while the higher-rated schools may expand.

This mechanism relies on high-quality, comparable information. Can you spot the problem? The mixed picture of reports under the old and new framework makes this a really difficult task – one that I suspect even the most engaged parents would find hard. If we think this mechanism is important, then perhaps we should invest in more frequent inspections so that parents have better information.

Personally, I’m sceptical about the potential of this mechanism. I worry about the accuracy and comparability of the reports. Also, the potential of this mechanism is limited by the fact that it can only really work when pupils transition between phases since so few pupils move schools midway through a phase and even if they are moving this probably comes with significant downsides such as breaking up friendship groups. Further, the potential of this mechanism is much more limited in rural areas where there is less realistic choice between schools. Finally, I worry about the fairness of this mechanism – what about the pupils left behind?

I cannot help but wonder if this mechanism might have been acting in reverse

Given the downgrading of many Outstanding schools I also cannot help but wonder if this mechanism might have been acting in reverse – how many pupils sent to ‘Outstanding’ schools in the past decade might have gone to a different school had it been re-inspected?

3. Identifying schools that need more support

Targeting additional resources and support where it is most needed makes a lot of sense. If we have accurate information about which schools would most benefit from support, then it is simple enough to then prioritise these schools.

Of course, for this mechanism to work, we need to correctly identify schools most in need and we need to have additional support that is genuinely useful to them.

4. Identifying areas for improvement

Ofsted’s reports identify key areas for improvement. This is potentially useful advice that schools can then focus on to improve further.

I’m sceptical about the potential of this mechanism alone because in my experience Ofsted rarely tells schools things that they do not already know.

5. Understanding the state of the nation

Ofsted have extraordinary insights into the state of the nation’s schools. Rather than supporting individual schools, this information could be used to tailor education policies by providing a vital feedback loop.

To get the most from this mechanism, it would be great to see Ofsted opening up their data for researchers to explore in a suitably anonymised manner.

Caveats and conclusions

I have not mentioned the role of Ofsted in safeguarding. Most people agree that we need a regulator doing this. But there is less consensus once focus goes from ‘food hygiene’ to ‘Michelin Guide’, to extend Amanda Spielman’s analogy.

Are there cheaper ways of activating these mechanisms?

I think it’s useful focusing on mechanisms and not just activities. It also worth considering cost-effectiveness – are there cheaper ways of activating these mechanisms? For instance, I’ve been really impressed by how Teacher Tapp have given rich insights into the state of the nation’s schools on a tiny budget. For context, Ofsted’s annual budget is more than the £125 million given to the EEF to spend over 15 years.

Which mechanisms do you think are most promising? Are there other mechanisms? Are there better ways of achieving these mechanisms? Are there more cost-effective ways?


The ITT Market Review could be a game changer

The ITT market review has the potential to make a dramatic difference to the future of the teaching profession and in turn the life experiences of young people. I’ve previously written about how the review could succeed by removing less effective providers from the market and replacing them with better ones.

In this post, I want to examine another mechanism: programme development. The consultation published earlier this year describes a number of activities, including more training for mentors, but I want to unpick the details of the mechanisms to help clarify thinking and focus our attention on the most important details.

Four lenses

There are a number of lenses we can use to look at programme development, including:

  1. Curriculum, pedagogyª, assessment – each of these has the potential to improve the programme.
  2. What trainees will do differently – this is a useful lens because it brings us closer to thinking about trainees, rather than just activities. Relatedly, Prof Steve Higgins invites us to think about three key ways to improve pupil learning; we can imrove learning by getting pupils to work harder, for longer, or more effectively or efficiently.
  3. Behaviour change – ultimately, the market review is trying to change the behaviour of people, including programme providers, partner schools and of course trainee teachers. Therefore, it is also useful to use the capability, opportunity, motivation model of behaviour change (COM-B).
  4. Ease of implementation – we need to recognise that ITT programmes have quite complex ‘delivery chains’ involving different partners. When considering the ease of implementation – and crucially scalability – it might help to consider where in the delivery chain changes in behaviour need to take place. Changes at the start of the delivery chain, such as to the core programme curriculum, are likely easier to make compared to those at the end such as changes within the placement schools.

(ªBe gone foul pedants, I’m not calling it andragogy.)

With these four lenses in hand, let’s consider how the market review might support the development of the programme.

The curriculum

The curriculum is as good a place as any to start, but first I’d like to emphasise that ITT programmes are complex – many different actors need to act in a coordinated manner – and this is perhaps felt most acutely when it comes to the curriculum. Instead of advocating the teaching of particular things, I’d like to highlight three specific mechanisms that could lead to change.

First, prioritising the most important learning, for instance, I am yet to find a trainee who would not benefit from more focused subject knowledge development. You can insert your own pet project or peeve here too.

Second, reducing the redundancy, or duplication, by cutting down on the overlap of input from different actors. For instance, in my experience, it is common for different actors to present models that are functionally similar, but different. There are lots of different models concerning how best to scaffold learning and different actors may introduce their personal favourite. Of course, there are sometimes sound reasons for presenting different models since the similarities and differences can help us to appreciate deeper structures, but where this variation is arbitrary it is just adding to the noise of an already challenging year for trainee teachers.

Third, sequencing is often the difference between a good and a great curriculum. Improved sequencing can help to optimise learning either by ensuring that trainees progressively develop their understanding and practice, or by ensuring that as trainees encounter new ideas, they also have the opportunity to apply them. The EEF’s professional development guidance highlights four mechanisms: building knowledge, motivating staff, developing teaching practice, and embedding practice. A challenge for ITT providers – given the logistics of school placements – is that there is often quite a gap between building trainee knowledge and providing opportunities – particularly involving pupils – for them to apply this knowledge.

Depending on the level of abstraction that we think about the programme, there are different mechanisms. At the highest level, it is instructive to think about trainees working longer, harder, more effectively or efficiently. I suspect we are at the limit of what can be achieved by getting trainees to work longer hours – short of extending the programme length. The market review consultation recommends a minimum length of 38 weeks so assuming trainees work a 40 hour week, we need to decide what is the best way for them to spend their 1,520 hours?

Teaching methods

Turning away from the curriculum, how might we improve the effectiveness and efficiency of our teaching methods? Here are some of the areas that I would explore.

  1. Can we make it easier for trainees to access brilliant content? High-quality textbooks tightly aligned with the programme content would be a very useful and scalable resource. Having to comb through lots of different reports not tailored specifically for programmes is a real inefficiency.
  2. Do we want trainees to spend so much time engaging with primary research? It’s definitely a useful skill to develop, but the best way to be able to independently and critically engage with primary research is not to simply be thrown into it. It’s not that this is an inherently bad idea, just that it has a high opportunity cost.
  3. How do we make better use of trainee’s self-directed study? I suspect giving access to better resources – particularly for subject knowledge development, is an easy win. There may also be merit in helping to develop better study habits.
  4. Do we really need trainees to complete an independent research project? I think trainees should engage more with research, but as users, not producers. My starting point would be helping trainees to recognise different types of claims, and assessing the rigour and relevance of the supporting evidence. This is not too technical, and it is fundamental to building the research literacy of the profession. For the purists who cannot let go of the individual research project, I would point to the need for greater scaffolding – managing an entirely new research project just has too many moving parts for trainees become proficient in any of them. One way doing research could be scaffolded is through some micro-trials similar to the work of the EEF’s teacher-choices trials, or WhatWorked. There is a growing body of evidence from other fields that replications can be a useful teaching tool and generate useable knowledge. These do not have to be limited to trials, but could also include common data collection instruments and aggregating data. This would allow the systematic accumulation of knowledge from a large and interested workforce. The forthcoming Institute of Teaching could help to coordinate this kind of work.
  5. Can we cut down on time trainees spend on other things? Travelling between venues, waiting around, and cutting and preparing resources all seem like areas that could be optimised. The gains here might not be big, but they are probably quite easy to achieve.
  6. Can we improve the quality of mentoring? The market review consultation focuses quite a bit on mentoring. I agree that this is probably a really promising mechanism, but it is also probably really hard to do – particularly at scale.

Some of the ideas listed above are easier to do than others, and will have different impacts. When considering the ease of implementation – and crucially scalability – it might help to consider where in the delivery chain changes in behaviour need to take place. Changes at the start of the delivery chain, such as to the core programme curriculum, are likely easier to make compared to those at the end such as changes within the placement schools. Through this lens it becomes obvious that it’s considerably harder to improve the quality of mentoring provided by thousands of mentors compared to investing in providing high-quality, structured information in textbooks, for instance.


Finally, let’s examine assessment. An assessment is a process for making an inference, so what inferences do we need to make as part of an ITT programme? I think there are four types for each prospective teacher:

1. Are they suitable to join our programme?

2. What are the best next steps in their development?

3. Are they on track to achieve Qualified Teacher Status (QTS)?

4. Have they met the Teachers’ Standards to recommend QTS?

The first and final inference are both high stakes for prospective teachers. The Teachers’ Standards are the basis for the final inference – but what is there to support the first inference? From sampling the DfE’s Apply Service, it is evident that there is quite diverse practice between providers – how might we support providers to improve the validity and fairness of these assessments? Assessment is always tricky, but it is worth stepping back to appreciate how hard the first assessment is – we are trying to make predictions about some potentially very underdeveloped capabilities. What are the best predictors of future teaching quality? How can we most effectively select them? How do we account for the fact that some candidates have more direct experience of teaching than others?

The second and third inference are about how we can optimise the development of each trainee, and also how we identify trainees who may need some additional support. This support might be linked directly to their teaching, or it may concern wider aspects needed to complete their programme such as their personal organisation. Getting these assessments right can help to increase the effectiveness and efficiency of the programme – and in turn the rate of each trainee’s development.

Assessment is difficult so it would almost definitely be helpful to have some common instruments to support each of these inferences. For instance, what about some assessments of participants’ subject knowledge conducted at multiple points in the programme. These could provide a valuable external benchmark, and also be used diagnostically to support each trainee’s development. Done right, they could also be motivating. Longer-term, this could provide a valuable feedback loop for programme development. Common assessments at application could also help shift accountability of ITT providers onto the value that they add to their trainees, rather than just selecting high-potential trainees.

Final thoughts

I’ve used focused on the mechanisms that might lead to improvements in ITT provision. We can think of these mechanisms with different levels of abstraction and I have offered four lenses to support this: curriculum, assessment, and pedagogy; what trainees will do differently; ease of implementation; and behaviour change.

My overriding thought is that there is certainly the potential for all ITT providers to further improve their programmes using a range of these potential mechanisms and others. However, improvement will not be easy and the DfE will need to focus on capability, opportunity, and motivation. In other words, support and time are necessary to realise some of these mechanisms. Therefore, is it worth thinking again about the proposed timescale? Including what happens once providers have been reaccredited?


How could the ITT market review succeed?

The ITT Market Review aims to ensure consistently high-quality training in a more efficient and effective market. It is currently out for consultation and could dramatically reshape how we prepare teachers in England.

The main recommendation is that all providers should implement a set of new quality requirements and that an accreditation process should ensure that providers can meet these requirements.

The new quality requirements cover:

  • A greater focus on the curriculum linked to the Core Content Framework
  • The identification of placement schools and ensuring that these placements are aligned with the training curriculum
  • The identification and training of mentors, including introducing the new role of lead mentors
  • The design and use of a detailed assessment framework
  • Processes for quality assurance throughout the programme
  • The structures and partnerships needed to deliver a programme and hold partners accountable for the quality of their work
  • An expectation that courses last at least 38 weeks, with at least 28 weeks in schools

How could the review succeed?

Instead of focusing on the nature of the proposals – should courses last at least 38 weeks? Is the Core Content Framework appropriate? – I want to analyse the proposals in their own terms: are they likely to achieve their stated goals? To do this, it helps to think about potential mechanisms that could lead to improvements and potential support factors and unintended effects.

Mechanism 1: Removing less effective providers

Less effective providers could be removed from the market, and this could raise average quality. This could happen in multiple ways: providers might decide it is all too much not re-apply, which is problematic if they are a strong provider. Second, some providers may merge. Third, some providers may try but fail to meet the requirements. Time will tell how many of the 240 accredited ITT providers fit into each category.

How do we accurately assess the quality of provision?

Getting this right is fundamental if removing less effective providers is a crucial mechanism for strengthening the market. However, we should consider for every less effective provider we remove, how many strong providers we are willing to sacrifice because of the fundamental trade-off between false positives and negatives in any selection process.

The distribution of provider quality and how the assessment is done will influence the relative trade-off between sensitivity and specificity. Do we know the distribution of provider quality? My hunch is that most providers are similar, but there are long tails of stronger and weaker providers. If this is the case, do we draw the line to chop off the tail of weaker providers, or do we cut into the body of similar providers?

The second consideration is how to judge provider quality. The consultation offers a high-level process on page 29 involving a desk-based exercise with providers responsible for submitting evidence. But who will apply the quality requirements to the evidence submitted? Civil servants supported by some expert input? This might work well for some aspects, such as assessing quality assurance processes, but the heart of the reforms – the curriculum – is much harder to assess.

To maximise the accuracy of judgements, it makes sense to do it in phases: an initial separation of those that very clearly do or do not meet the criteria and then a more intensive stage for those that might meet the requirements. Otherwise, an appeal mechanism might be wise. Using a phased approach could improve the assessments’ accuracy while making the most of everyone’s finite resources.

While still thinking about the distribution of provider quality, it is worth asking if there is enough meaningful variation. If most providers are pretty similar, then at best, we can only make relatively minor improvements by removing the least effective providers. There might be more meaningful variation at the subject level or even at the level of individual tutors. If true, and we could accurately measure this variation, this would hint at a very different kind of market review (licences for ITT tutors, anyone? No?). For context, eight providers each developed over 500 teachers last year, and UCL almost reached 1,600 – should we look at a more granular level for these providers?

The less effective providers are gone; what next?

We now need to replace the capacity we removed by introducing new providers or expanding the remaining higher-quality providers. Removing lots of less effective providers is a promising sign the mechanism is working, but it poses a challenge: can we bring in new capacity that is – quite a bit – better? This may depend on how much we lose: is it 5, 15 or even 50 per cent?

Do we know if new providers will join? It would probably be wise to determine if this is likely – and the potential quality – before removing existing providers. The quality requirements set a high bar for new entrants, so a rush of new providers seems unlikely. That said, some Trusts and Teaching School hubs may come forwards – especially if given the grant funding for the set-up work advocated in the review. Other providers like the ECF and NPQ providers not already involved with ITT, including Ambition Institute, may consider applying.

Can we replace the capacity we remove with capacity that is – quite a lot – better?

Expanding existing strong providers seems desirable and straightforward enough, but we should heed the warnings from countless unsuccessful efforts of scaling promising ideas. Spotting barriers to scalability – before you hit them – is often tricky. Sir David Carter’s observation that when Trusts grow to the point that the CEO can no longer line manage all of the headteachers, a scalability barrier has been reached – new systems and processes are needed to continue operating effectively.

What are the barriers for an ITT provider? The brilliant people and the delicate partnerships with placement schools that have often developed over several years are challenging to scale. No doubt there are many more.

Before we forget, what about those providers that merged to get through the application process? How do we ensure the best practice is embedded across their work? Again, this isn’t easy, and we will likely have to base the judgements on providers’ plans rather than the actual implementation, given the timeline. Nonetheless, it seems likely that money and time will help. An analogy is the Trust Capacity Fund that provides additional funding to expanding Trusts for focused capacity building. 


If we think that removing less effective providers is an effective mechanism for the ITT Market Review, then we should:

  • Purposefully design and implement the selection process
  • Plan for how to replace the removed capacity
  • Ensure that time and money are not undue obstacles
  • Consider phasing the approach

In part two, I explore another mechanism – programme development – that the ITT Market Review might use to achieve its goals.


Questions first, methods second

Research tries to answer questions. The range of education research questions is vast: why do some pupils truant? What is the best way to teach fractions? Which pupils are most likely to fall behind at school? Is there a link between the A-levels pupils study and their later earnings in life?

Despite the bewildering array of questions, education research questions can be put into three main groups.

  1. Description. Aims to find out what is happening, like how many teachers are there in England? What is the average KS2 SAT score in Sunderland?
  2. Association. Aims to find patterns between two or more things, like do pupils eligible for free school meals do worse at GCSE than their more affluent peers?
  3. Causation. Aims to answer if one thing causes another, like does investing in one-to-one tuition improve GCSE history outcomes?

The research question determines the method

A really boring argument is what is the best type of research. Historically, education has been plagued with debates about the merits of qualitative versus quantitative research. 

A useful mantra is questions first, methods second. Quite simply some methods are better suited to answer some questions than others. A good attempt to communicate this comes from the Alliance for Useful Evidence’s report, ‘What Counts As Good Evidence?

Have a go at classifying these questions into the three categories of description, association, or causation.

  1. How many teachers join the profession each year in England?
  2. What percentage of children have no breakfast?
  3. How well on SATS do children do who have no breakfast?
  4. Does running a breakfast club improve pupils’ SATS scores?
  5. How prevalent is bullying in England’s schools?
  6. Are anti-bullying interventions effective at stopping bullying?
  7. Does reading to dogs improve pupils’ reading?
  8. Is it feasible to have a snake as a class pet?
  9. Is there a link between school attendance and pupil wellbeing?
  10. Does marking work more often improve science results?

Answers: 1) descriptive 2) descriptive 3) associative 4) causal 5) descriptive 6) causal 7) causal 8) descriptive 9) associative 10) causal

Finally, if you want a fantastic guide to research questions, then Patrick’s White’s Developing Research Questions is an excellent read.


Effect sizes

Effect sizes are a popular way of communicating research findings. They can move beyond binary discussions about whether something ‘works’ or not and illuminate the magnitude of differences.

Famous examples of effect sizes include:

  • The Teaching and Learning Toolkit’s months’ additional progress
  • Hattie’s dials and supposed ‘hinge point’ of 0.4

Like anything, it is possible to use effect sizes more or less effectively. Still, considering these four questions will ensure intelligent use.

What type of effect size is it?

There are two fundamentally different uses of effect sizes. One communicates information about an association; the other focuses on interventions. Confusing the two effect sizes leads to the classic statistical mistake of confusing correlation with causation.

Understanding the strength of associations, or correlations, is important. It is often the first step to learning more about phenomena. For instance, knowing that there is a strong association between parental engagement and educational achievement is illuminating. However, this association is very different from the causal claim that improving parental engagement can improve school achievement (See & Gorard, 2013). Causal effect sizes are more common in education; we will focus on them with the remaining questions.

How did the overall study influence the effect size?

It is tempting to think that effect sizes tell us something absolute about a specific intervention. They do not. A better way to think of effect sizes is as properties of the entire study. This does not make effect sizes useless, but they need more judgement to make sense of them than it may first appear.

Let’s look at the effect sizes from three EEF-funded trials (Dimova et al., 2020; Speckesser et al., 2018; Torgerson et al., 2014):

All these programmes seem compelling, and Using Self-Regulation to Improve Writing appears the best. These are the two obvious – and I think incorrect – conclusions that we might draw. These studies helpfully illustrate the importance of looking at the whole study when deciding the meaning of any effect size.

1. Some outcomes are easier to improve than others. 

The more closely aligned an outcome is to the intervention, the bigger the effects we can expect (Slavin & Madden, 2011). So we would expect a programme focusing on algebra to report larger effects for algebra than for mathematics overall. This is critical to know when appraising outcomes that have been designed by the developers of interventions. In extreme cases, assessments may focus on topics that only the intervention group have been taught!

There’s also reason to think that some subjects may be easier to improve than others. For instance, writing interventions tend to report huge effects (Graham, McKeown, Kiuhara, & Harris, 2012). is there something about writing that makes it easier to improve?

2. If the pupils are very similar, the effects are larger.

To illuminate one reason, consider that around 13 per cent of children in the UK have undiagnosed vision difficulties (Thurston, 2014). Only those children with vision difficulties can possibly benefit from any intervention to provide glasses. If you wanted to show your intervention was effective, you would do everything possible to ensure that only children who could benefit were included in the study. Other pupils dilute the benefits.

3. Effects tend to be larger with younger children.  

Young children tend to make rapid gains in their learning. I find it extraordinary how quickly young children learn to read, for example. 

A more subtle interpretation I’ve heard Professor Peter Tymms advocate is to think about how deep into a subject pupils have reached. This may explain the large effects in writing interventions. In my experience, the teaching of writing is typically much less systematic than reading. Perhaps many pupils are simply not very deep into learning to write so make rapid early gains when writing is given more focus.

4. More rigorous evaluations produce smaller effects.

A review of over 600 effect sizes found that random allocation to treatment conditions is associated with smaller effects (Cheung & Slavin, 2016). Effects also tend to be smaller when action is taken to reduce bias, like the use of independent evaluations (Wolf, Morrison, Inns, Slavin, & Risman, 2020). This is probably why most EEF-funded trials – with their exacting standards (EEF, 2017) – find smaller effects than the earlier research summarised in the Teaching and Learning Toolkit.

5. Scale matters

A frustrating finding in many research fields is that as programmes get larger, effects get smaller. One likely reason is fidelity. A fantastic music teacher who has laboured to create a new intervention is likely much better at delivering it than her colleagues. Even if she trained her colleagues, they would likely remain less skilled and motivated to make it work. Our music teacher is an example of super realisation bias that can distort small scale research studies.

Returning to our three EEF-funded studies, it becomes clear that our initial assumption that IPEELL was the most promising programme may be wrong. My attempt at calibrating each study against the five issues is shown below. The green arrows indicate we should consider mentally ‘raising’ the effect size. In contrast, the red arrows suggest ‘lowering’ the reported effect sizes. 

This mental recalibration is imprecise, but accepting the uncertainty may be useful.

How meaningful is the difference?

Education is awash with wild claims. Lots of organisations promise their work is transformational. Perhaps it is, but the findings from rigorous evaluations suggest that most things do not make much difference. A striking fact is that just a quarter of EEF-funded trials report a positive impact.

Historically, some researchers have sought to give benchmarks to guide interpretations of studies. Although they are alluring, they’re not very helpful. A famous example is Hattie’s ‘hinge point’ of 0.4, which was the average from his Visible Learning project (Hattie, 2008). However, the included studies’ low quality inflates the average; the contrast with the more modest effect sizes from rigorous evaluations is clear-cut. However, it does highlight the absurdity of trying to compare effect sizes with universal benchmarks.

The graphic below presents multiple representations of the difference found in the Nuffield Early Language Intervention (+3 months’ additional progress) between the intervention and control groups. I created it using this fantastic resource. I recommend using it as the multiple representations and interactive format help develop a more intuitive feeling for effect sizes.

How cost-effective is it?

Thinking about cost often changes what looks like the best bets. Cheap, low impact initiatives may be more cost-effective than higher impact, but more intensive projects. An excellent example is the low impact and ultra-low-cost of texting parents about their children’s learning (Miller et al., 2016).

It is also vital to think through different definitions of cost. In school, time is often the most precious resource.

In summary

Effect sizes are imperfect but used well they have much to offer. Remember to ask:

  • What type of effect size is it?
  • How did the overall study influence the effect size?
  • How meaningful is the difference?
  • How cost-effective is it?

Next steps for further reading

Kraft, M. A. (2020). Interpreting Effect Sizes of Education Interventions. Educational Researcher.

Piper, K. (2018). Scaling up good ideas is really, really hard — and we’re starting to figure out why. Retrieved January 23, 2021, from

Simpson, A. (2018). Princesses are bigger than elephants: Effect size as a category error in evidence-based education. British Educational Research Journal, 44(5), 897–913.


Cheung, A. C. K., & Slavin, R. E. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45(5), 283–292.

Dimova, S., Ilie, S., Brown, E. R., Broeks, M., Culora, A., & Sutherland, A. (2020). The Nuffield Early Language Intervention. London. Retrieved from

EEF. (2017). EEF standards for independent evaluation panel members. Retrieved from

Graham, S., McKeown, D., Kiuhara, S., & Harris, K. R. (2012). A meta-analysis of writing instruction for students in the elementary grades. Journal of Educational Psychology, 104(4), 879–896.

Hattie, J. (2008). Visible learning: a synthesis of over 800 meta-analyses relating to achievement. Abingdon: Routledge.

Miller, S., Davison, J., Yohanis, J., Sloan, S., Gildea, A., & Thurston, A. (2016). Texting parents: evaluation report and executive summary. London. Retrieved from

See, B. H., & Gorard, S. (2013). What do rigorous evaluations tell us about the most promising parental involvement interventions? A critical review of what works for disadvantaged children in different age groups. London: Nuffield Foundation. Retrieved from

Slavin, R. E., & Madden, N. A. (2011). Measures Inherent to Treatments in Program Effectiveness Reviews. Journal of Research on Educational Effectiveness, 4(4), 370–380.

Speckesser, S., Runge, J., Foliano, F., Bursnall, M., Hudson-Sharp, N., Rolfe, H., & Anders, J. (2018). Embedding formative assessment: evaluation report and executive summary. London. Retrieved from

Thurston, A. (2014). The Potential Impact of Undiagnosed Vision Impairment on Reading Development in the Early Years of School. International Journal of Disability, Development and Education, 61(2), 152–164.

Torgerson, D. J., Torgerson, C. J., Ainsworth, H., Buckley, H., Heaps, C., Hewitt, C., & Mitchell, N. (2014). Using self-regulation to improve writing. London. Retrieved from

Wolf, R., Morrison, J., Inns, A., Slavin, R., & Risman, K. (2020). Average Effect Sizes in Developer-Commissioned and Independent Evaluations. Journal of Research on Educational Effectiveness, 13(2), 428–447.


Invest in stopping ineffective things

Schools are hotbeds of innovation. In my role supporting schools to develop more evidence-informed practice, I always admire teachers’ creativity and dedication. However, I also see colleagues trying to do too many things, including things likely to have limited impact based on the best available evidence.

A clear message from the Education Endowment Foundation’s popular resource on putting evidence to work is that schools should do fewer things better (EEF, 2019). This includes stopping things that are less effective in order to release the capacity to do even better things. In my experience, these messages are beginning to take hold; they also feature prominently in the new national professional qualifications.

At a system level, I think we should do more to stop ineffective initiatives. The Department for Education (DfE) is increasingly good at scaling up initiatives with promise, such as the Nuffield early language intervention (NELI), which, according to multiple rigorous evaluations, has improved children’s communication and language (Dimova et al., 2020).

What about ineffective programmes?

A recent evaluation of Achievement for All’s flagship programme – used by around 10 per cent of schools in England – provides a fascinating case study (Humphrey et al., 2020). The evaluation was concerning: it found children in the control schools did considerably better than their peers in schools using the intervention. The study received the EEF’s highest security rating of five padlocks based on the randomised design, large scale, low dropout and low risk of wider threats to validity. This is on top of the EEF’s exacting standards, involving independent evaluation and pre-specifying the analysis to reduce ‘researcher degrees of freedom’ (EEF, 2017; Gehlbach & Robinson, 2018).

In short, we can be very confident in the headline: children in the Achievement for All schools made two months’ less progress in reading, on average, compared to children in schools that did not receive the programme.

What happened after the evaluation?

Not much.

The EEF (2020) published helpful guidance for schools currently using the programme, and Achievement for All published a blog (Blandford, n.d.) essentially rejecting the negative evaluation – yet many schools continue to use the programme.

The contrast is stark: when programmes are evaluated with promising results, they are expanded; when evaluations are less positive, there are limited consequences.

What if we actively stopped ineffective interventions?

If we assume that the findings from the evaluations of programmes such as Achievement for All generalise to the wider population of schools already using the programme – a quite reasonable assumption – then investing in stopping it is an excellent investment.

A bold option is to simply pay organisations to stop offering ineffective programmes – think ‘golden goodbyes’. The government, or a brave charity, could purchase the intellectual property, thank the staff for their service, provide generous redundancy payments, and concede that the organisation’s mission is best achieved by stopping a harmful intervention.

The contrast is stark: when programmes are evaluated with promising results, they are expanded; when evaluations are less positive, there are limited consequences.

If that feels too strong, what about simply alerting the schools still using the programme and supporting them to review whether the programme is working as intended in their own school. Remember, for Achievement for All, this is around 1 in 10 of England’s schools. New adopters of ineffective programmes could be discouraged by maintaining a list of ‘not very promising projects’ to mirror the EEF’s ‘promising projects’ tool, though we may need a better name.

These ideas scratch the surface of what is possible, but I think there is a strong case for using both positive and negative findings to shape education policy and practice.

Finally, there is an ethical dimension: is it right to do so little when we have compelling evidence that certain programmes are ineffective?

This post was originally published in the BERA Blog


Blandford, S. (no date). Education Endowment Foundation Achievement for All: Years 4 and 5 Trial Programme (2016–2018)

Dimova, S., Ilie, S., Brown, E. R., Broeks, M., Culora, A., & Sutherland, A. (2020). The Nuffield early language intervention. London: Education Endowment Foundation. Retrieved from

Education Endowment Foundation [EEF]. (2017). EEF standards for independent evaluation panel members. London. Retrieved from

Education Endowment Foundation [EEF]. (2019). Putting evidence to work: A school’s guide to implementation. London. Retrieved from

Education Endowment Foundation [EEF]. (2020). The EEF’s evaluation of Achievement for All: Answers to key questions for teachers and school leaders. Retrieved from

Gehlbach, H., & Robinson, C. D. (2018). Mitigating illusory results through preregistration in education. Journal of Research on Educational Effectiveness11(2), 296–315.

Humphrey, N., Squires, G., Choudry, S., Byrne, E., Demkowicz, O., Troncoso, P., & Wo, L. (2020). Achievement for All: Evaluation report. London: Education Endowment Foundation. Retrieved from


Unmasking education trials

Recent weeks have seen a series of exciting announcements about the results of randomised controlled trials testing the efficacy of vaccines. Beyond the promising headlines, interviews with excited researchers have featured the phrase ‘unmasking’. But what is unmasking, and is it relevant to trials in education?

Unmasking is the stage in a trial when researchers find out whether each participant is in the control or intervention group. In healthcare, there are up to three ways that trials can be masked. First, participants may be unaware whether they are receiving the intervention; second, practitioners leading the intervention, like nurses providing a vaccination, may not know which participants are receiving the intervention; third, the researchers leading the trial and analysing the data may not know which treatment each participant receives.

Each of these masks, also known as blinding, is designed to prevent known biases. If knowledge of treatment allocation changes the behaviour of stakeholders – participants, practitioners, researchers – this may be misattributed to the intervention. For instance, in a trial testing vaccinations, participants who know that they have received the real vaccine may become more reckless, which could increase their risk of infection; practitioners may provide better care to participants they know are not getting the vaccine; researchers may make choices – consciously or sub-consciously – that favour their preferred outcomes.

Unmasking is the stage in a trial when researchers find out whether each participant is in the control or intervention group

These various risks are called social interaction threats, and each has various names. Learning the names is interesting, but I find it helpful to focus on their commonalities: they all stem from actors in the research changing their behaviour based on treatment allocation. The risk is that these can lead to apparent effects that are misattributed to the intervention.

  • Diffusion or imitation of treatment is when the control group starts doing – or at least attempts – to imitate the intervention.
  • Compensatory rivalry is when the control group puts in additional effort to ‘make up’ for not receiving the intervention.
  • Resentful demoralisation is the opposite of compensatory rivalry because the control group become demoralised after finding our they will miss out on the intervention.
  • Compensatory equalisation of treatment is when practitioners act favourably towards participants they perceive to be getting the less effective intervention.

So what does this all have to do with education?

It is easy to imagine how each threat could become a reality in an education trial. So does it matter that masking is extremely rare in education? Looking through trials funded by the Education Endowment Foundation, it is hard to find any that mention blinding. Further, there is limited mention in the EEF’s guidance for evaluators.

It would undoubtedly help if trials in education could be masked, but there are two main obstacles. First, there are practical barriers to masking – is it possible for a teacher to deliver a new intervention without knowing they are delivering it? Second, it could be argued that in the long list of things that need improving about trials in education, masking is pretty low down the list.

Although it is seldom possible to have complete masking in education, there are practical steps that can be taken. For instance:

  • ensuring that pre-testing happens prior to treatment allocation
  • ensuring that the marking, and ideally invigilation, of assessments is undertaken blind to treatment allocation
  • incorporating aspects of ‘mundane realism’ to minimise the threats of compensatory behaviours
  • analysing results blind to treatment allocation, and ideally guided by a pre-specified plan; some trials even have an independent statistician lead the analysis
  • actively monitoring the risk of each of these biases

I do not think we should give up all hope of masking in education. In surgery, so-called ‘sham’ operations are sometimes undertaken to prevent patients from knowing which treatment they have received. These involve little more than making an incision and then stitching it back up. It is possible to imagine adapting this approach in education.

We should also think carefully about masking on a case-by-case basis as some trials are likely at greater risk of social threats to validity than others. For instance, trials where control and intervention participants are based in the same school, or network of schools, are likely at the greatest risk of these threats.

In conclusion, a lack of masking is not a fatal blow to trials in education. We should also avoid thinking of masking as an all or nothing event. As Torgerson and Torgerson argue, there are different ways that masking can be undertaken. Taking a pragmatic approach where we (1) mask where possible, (2) consider the risks inherent in each trial and (3) closely monitor for threats when we cannot mask is probably a good enough solution. At least for now.


A shared language for great teaching

The best available evidence indicates that great teaching is the most important lever schools have to improve outcomes for their pupils. That is a key message from the Education Endowment Foundation’s guide to supporting school planning.

Assuming this is true, the next question is what – exactly – is great teaching?

My experience is that many teachers and school leaders struggle articulating this. Channelling Justice Potter Stewart, one headteacher recently quipped to me that he ‘knew it when he saw it’. To create a great school, underpinned by great teaching, I do not think it is enough to know it when you see it. I think it is critical to have a shared language to discuss and think about great teaching. For instance, it can facilitate precise, purposeful discussion about teaching, including setting appropriate targets for teacher development.

Developing a unified theory of great teaching is extremely ambitious. Maybe too ambitious. Increasingly, I think it likely matters more that there is an explicit and ideally coherent model, rather than exactly which model is adopted. Here are five candidates that schools may want to consider.

1. What makes great teaching?

First up, what better place to start than the hugely influential report from the Sutton Trust. The report pinpointed six components of great teaching and is still an excellent read, even though it lacks the granularity of some of the others on my list.

2. The Early Career Framework

The Early Career Framework provides a detailed series of statements to guide the development of teachers in the first couple of years of their career. The statements are aligned to the Teachers Standards and draw on the evidence base as well as experienced teachers professional understanding.

Although the framework is designed around supporting teachers in the early stages of their career, it could no doubt be used.

3. The Great Teaching Toolkit

The Great Teaching Toolkit draws on various evidence sources to present a four factor model of great teaching. There are many things to like in this slick report, including the pithy phrases. As a simple, yet powerful organising framework for thinking about great teaching, I really like this model.

4. Teach Like A Champion

Teach Like A Champion is never short of both critics and zealots. For me, I like the codification of things that many experienced teachers do effortlessly. It is far from a comprehensive model of great teaching, but in the right hands it is likely a powerful tool for teacher development.

5. WalkThrus

Mirroring Teach Like A Champion’s codification of promising approaches, is the Walkthrus series. The granularity of these resources is impressive and is again likely a powerful way of focusing teacher development.

These models each have their strengths and their weaknesses. However, I’m attracted to having an explicit model of great teaching as the basis for rich professional discussion.


Five websites to understand schools

Schools are fascinating places and everyone has opinions about them.

Unfortunately, we are often blinkered by our own relatively narrow experience as pupils, parents, teachers or concerned citizens.

To fully engage with discussions – without being overly influenced by individual institutions – we need to calibrate our intuitions. We can do this using the rich data available about English Schools.

Here are five of my go to places to calibrate my intuitions.

1. Statistics at DfE

The Department for Education publishes official statistics on education and children. The upside is they hold large amounts of data; this is also the downside.

Making sense of the DfE’s statistics can be fiendishly difficult, but there are two portals that can help:

These DfE sites are useful, but they are not fun.

2. FFT Education Datalab

FFT Education Datalab produces independent, cutting-edge research on education policy and practice.

They are especially good at making sense of the complex array of data held by the DfE and generating fascinating insights. Their blog is always worth a read. If you have never seen it, stop reading this blog and read theirs.

3. School Dash

SchoolDash helps people and organisations to understand schools through data.

I particularly like their interactive maps and their fascinating analysis of the teacher jobs market. Exploring their various tools is an excellent way to calibrate your intuitions about schools.

4. Watchsted

Ofsted visit lots of schools and are potentially a very rich data source to understand the nation’s schools (putting concerns about validity to one side).

Watchsted aims to help spot patterns in the data. Personally, I like looking at the word clouds that you can create.

5. Teacher Tapp

Everyday, Teacher Tapp asks thousands of teachers three questions via an app. These range from the mundane to the bizarre. But – together – they generate fascinating insights into the reality of schools. They also provide a regular, light touch form of professional development.

For me, two features stand out. First, how quickly Teacher Tapp can respond to pressing issues. Their insights during COVID-19 have been remarkable.

Second, is the potential to understand trends over time. Their weekly blog often picks up these trends and no doubt these will become even more fascinating over time.

How do you calibrate your intuitions about schools?


A manifesto for better education products

It seems a lifetime ago that the BBC was pressured to remove its free content for schools by organisations intent on profiting from the increased demand for online learning materials.

As schools and a sense of normality return, however, the likelihood is that schools will continue to need to work digitally – albeit more sporadically. It is therefore crucial that we fix our systems now to avoid a messy second wave of the distance learning free-for-all.

Here’s my fantasy about how we could do things differently so that the interests of everyone involved are better aligned. Let’s be clear, there are real challenges, but change is possible.

To make my point, I want to start with the low hanging fruit of Computer Assisted Instruction – think Seneca or Teach Your Monster to Read.

Computer Assisted Instruction is widely used. There is an abundance of platforms – each with their strengths – but none of them ever quite satisfy me and using multiple platforms is impractical as pupils spend more time logging in than learning.

Using three guiding principles, I think we could have a better system.

Principle 1: fund development, not delivery

Who do you think works for organisations offering Computer Assisted Instruction? That’s right, salespeople.

Curriculum development is slow, technical work so developers face high initial costs. As they only make money once they have a product, the race is on to create one and then flog it – hence the salespeople.

The cost of one additional school using the programme is negligible. Therefore, we could properly fund the slow development and then make the programmes free at the point of use.

Principle 2: open source materials

Here’s a secret: if you look at curriculum mapped resources and drill down to the detail, it’s often uninspiring because – as we have already seen – rationale developers create a minimum viable product before hiring an ace sales team.

Our second principle is that anything developed has to be made open source – in a common format, freely available – so that good features can be adopted and improved.

This approach could harness the professional generosity and expertise that currently exists in online subject communities, like CogSciSci and Team English.

Principle 3: try before we buy

Most things do not work as well as their producers claim – shocking, I know. When the EEF tests programmes only around a quarter are better than what schools already do.

Our third principle, is that once new features, like a question set, have been developed, they have to be tested and shown to be better than what already exists, before they are rolled out.

By designing computer assisted instruction systems on a common framework – and potentially linking it to other data sources – we can check to see if the feature works as intended.

A worked example

Bringing this all together, we end up with a system that better focuses everyone’s efforts on improving learning, while also likely delivering financial savings.

It starts with someone having a new idea about how to improve learning – perhaps a new way of teaching about cells in Year 7 biology. They receive some initial funding to develop the resource to its fullest potential. The materials are developed using a common format and made freely available for other prospective developers to build on.

Before the new materials are released to students, they are tested using a randomised controlled trial. Only the features that actually work are then released to all students. Over time, the programme gets better and better.

The specifics of this approach can be adjusted to taste. For instance, we could pay by results; ministers could run funding rounds linked to their priorities, like early reading; we could still allow choice between a range of high quality programmes.

Further, the ‘features’ need not be restricted to question sets; a feature could be a new algorithm that optimises the spacing and interleaving of questions; a dashboard that provides useful insights for teachers to use; a system to motivate pupils to study for longer.

This approach allows the best elements of different programmes to be combined, rather than locking schools into a single product ecosystem.

I think we can do better than the current system – but are we willing to think differently?