Evidence use

Evidence and Timetables

We’re approaching that time of year when attention turns to timetables.

Deciding what to prioritise involves integrating our values with considerations of effectiveness and logistical constraints. Effectiveness, however, could mean multiple things: are we maximising attainment, minimising workload or simply ensuring no one is too unhappy? 

Writing for Tes, I’ve highlighted three insights that I think we can discern from the evidence base.

Evidence generation

The DfE’s Evaluation Strategy

This week the DfE published an evaluation strategy. To my knowledge, this is the first time the department has published one. I think they should be applauded for taking evaluation increasingly seriously, but I want to offer some unsolicited feedback.

Demonstrating impact

The webpage hosting the report describes ‘how the Department for Education will work towards using robust evaluation practices to demonstrate impact and build our evidence base’.

‘Demonstrating impact’ presupposes that there is any impact, yet the consistent finding from evaluations is that it is rare to see programmes that genuinely do make an impact. The sociologist Peter Rossi described this as the iron law of evaluation: the expected value of any net impact assessment of any large scale social program is zero.

The reason ‘demonstrating impact’ concerns me is that it exposes a genuine misunderstanding about the purpose of evaluation. Evaluation can be highly technical, but understanding the purposes of evaluation should be possible for everyone.

The purposes of evaluation

I think evaluation has two purposes for a government organisation. First, there is an aspect of accountability. All politicians make claims about what they will do and evaluation is an important means of holding them to account and improving the quality of public debate.

I think of this like we are letting politicians drive the car. They get to decide where they go and how they drive, but we need some objective measures of their success – did they take us where they said they would? How much fuel did they use? Did they scratch the car? Evaluation can help us decide if we want to let politicians drive the car again.

The second purpose of evaluation is to build useable knowledge that can be used in the future. I am much more interested in this purpose. Continuing our analogy, this includes things like learning where the potholes are and what the traffic conditions are like so that we can make better choices in the future.

It is not possible to evaluate everything, so the DfE need to prioritise. The strategy explains that it will prioritise activities that are:

  1. High cost
  2. Novel, or not based on a strong evidence-base
  3. High risk

I completely understand why these areas were chosen, but I think these criteria are really dumb. I think they will misallocate evaluation effort and fail to serve the purposes of democratic accountability or building useable knowledge.

If we wanted to prioritise democratic accountability, we would focus evaluations on areas where governments had made big claims. Manifestos and high-profile policies would likely be our starting point.

If we wanted to build useable knowledge, we might instead focus on criteria like:

  1. Is it a topic where evidence would change opinions?
  2. What is the scale of the policy?
  3. How likely is the policy to be repeated?
  4. How good an evaluation is it likely that we can achieve?
  5. Is there genuine uncertainty about aspects of the programme?

Leading an evaluative culture

The foreword by Permanent Secretary Susan Acland-Hood is encouraging as it suggests a greater emphasis on evaluation. It also makes the non-speciifc, but encouraging commitment that ‘this document is a statement of intent by myself and my leadership team to take an active role in reinforcing our culture of robust evaluation. To achieve our ambition, we will commit to work closely with our partners and stakeholders’.

The foreword also notes the need for a more evaluative culture across Whitehall. I think a starting point for this is to clarify the purposes of evaluation and demonstrating impact is not the correct answer.

A gentle way of creating an evaluative culture might be to increasingly introduce evaluations within programmes, like the recently published ‘nimble evaluations’ as part of the National Tutoring Programme. These approaches help to optimise programmes without the existential threat of finding they did not work.

Another way of creating an evaluation culture would be to focus more on the research questions we want to answer and offer incentives for genuinely answering them.


How many pupils does Ofsted guide to the ‘wrong’ school?

The problem

I have been thinking about the exemption from routine Ofsted inspections introduced for Outstanding schools in 2014.

Inevitably, some of these schools are no longer Outstanding, but we do not know which ones so some families choose the ‘wrong’ school. How many pupils have been affected?

Before I get into the detail, I want to confess my sympathy for Ofsted as an organisation and their staff. I am not ideological about inspection, and I dislike that Ofsted is often seen as the ‘baddies’. However, I am sceptical that inspections add much value to the school system or are cost-effective. For me, this is empirical, not political.


We can get our bearings by looking at how many pupils go to schools split by their inspection grade and the year the grades were awarded. I have excluded the lowest grades for clarity.

So, a little under 1.3 million pupils attend Outstanding rated schools, which on average were inspected in 2013, but some go back to 2006.

Note that I am using data that I downloaded from Get Information About Schools a couple of months ago. This data also takes a while to filter in from Ofsted, but the most recent inspections are irrelevant given the assumptions I explain later.

Estimating the size of the problem

This will be a rough estimate, so I want to be transparent and show my working. You’re welcome to offer a better estimate by changing some of my assumptions, adding more complexity or correcting mistakes – although I hope I have avoided mistakes!

I’m interested in two related questions:

1. How many pupils have joined schools that were not actually Outstanding?

2. How many pupils have joined schools that were not actually Outstanding but chose those schools because of Ofsted’s guidance?

1. How many pupils have joined schools that were not actually Outstanding?

We need to start by recognising that this is a long-term issue, so we can’t just look at the pupils in school today: we need to estimate the number that has passed through the school. To do this, I will assume that the cohort size in each school has remained constant.

The table below shows the number of pupils by the year their Outstanding grade was awarded. I have estimated the number in each year group as one-sixth of the total. I have then calculated the number of year groups affected.

So far, I don’t think anyone would disagree much with these approximations, although you could give more precise estimates.

Next, I need to make two assumptions concerning:

  1. How long a school remains Outstanding
  2. After this period, the proportion that is no longer Outstanding

I want my estimate to be conservative, so I will say that schools rated Outstanding remain Outstanding for five years. After that, half are no longer actually Outstanding. This second estimate is a bit of a guess, but it mirrors an estimate by Amanda Spielman.

So, if we put these two numbers into our simple model, we get the following table, which estimates that 280,000 pupils have joined schools that were not actually Outstanding.

Even if we make very conservative assumptions, this issue still affects a lot of pupils: it’s the classic multiplying a big number by a small number is still quite a big number situation.

Suppose all schools remain Outstanding for seven years, and after that, just 25% are no longer Outstanding; this still affects 75,000 pupils.

2. How many pupils have joined schools that were not actually Outstanding but chose those schools because of Ofsted’s guidance?

To answer this question, we need to multiply our answer to question 1 by the proportion of pupils who have gone to a different school based on the Outstanding rating. My best estimate of this is 10%, which would mean around 28,000 pupils.

My estimate is based on multiple sources, including comparing differences in the ratio of pupils to capacity between Good and Outstanding schools. This is the estimate that I am least confident about, though. Note that this is an average estimate for the population: individuals will vary a lot based on their values, priorities and their local context – especially their available alternative options.

So what?

First, I’m very open to better estimates of the magnitude of this issue, but I think this issue is an issue. Again, this is empirical, not political.

Second, defenders of the Outstanding exemption policy might reasonably argue that it refocused inspection by allowing more frequent inspections of poorly rated schools. The trouble with this argument is that Ofsted has never generated rigorous evidence that inspections aid school improvement. This would be a fairly simple evaluation if there was the will to do it – you could simply randomly allocate the frequency of inspections – but it is easy to understand why an organisation would be unwilling to take the risk.

Third, this issue is not over. Today, tomorrow, and next year, families will choose schools for their children based on Ofsted results, and some will be mislead. Even with the accelerated timeline for post-pandemic inspections. Not to mention the myriad other challenges to making valid inferences. There is also a risk of the reverse happening: families being sceptical about older Outstanding grades and placing less weight on them in their decision-making.

Fourth, if Ofsted had a theory of change that set out their activities, the outcomes they hope to achieve – and avoid – and the specific mechanisms that might lead to these outcomes, we could have more grown-up conversations about inspection. To judge Ofsted’s impact and implementation, we need to understand their exact intent.

Finally, as part of the promised review of education, we should think hard about these kinds of issues and how to minimise their impact.

Evidence use

How comparable are Ofsted grades?

What should Ofsted do, and how should they do it?

It’s a perennial question in education. An answer that will often come up in these conversations is that Ofsted gives vital information to parents about the quality of schools.

I am sceptical that Ofsted currently does this very well for two reasons:

  1. It’s tricky comparing grades between years – particularly as the nature of inspections shifts even within a year
  2. It’s not possible to compare inspections between different frameworks

How big is the issue?

Armed with my scepticism, I explored these issues by comparing secondary schools within every parliamentary constituency. I chose constituencies as the unit of analysis since it is a reasonable approximation of the schools a family chooses between. Constituencies have a median of 6 secondary schools (interquartile range: 5-7; range: 2-15).

I turned my two concerns into three binary questions that could be answered for each constituency:

  1. Are there the same grades but from different years?
  2. Are there the same grades but using different frameworks?
  3. Is there an Outstanding on the old framework and a good on the new framework?

I found that the first barrier affects nine out of 10 constituencies. Two-thirds of constituencies are affected by the second barrier; one-third are affected by the final barrier.

Some examples

Let’s look at some examples. Bermondsey and Old Southwark has nine secondary schools – which one is the best? One of the four Outstanding schools, right? Except only half of the previously exempt Outstanding schools have retained their grades so far.

The only inference I would feel confident with is that it looks like any of the choices will be quite a good one – which is fortunate – but it’s hard to argue that Ofsted is doing an excellent job for the families of Bermondsey and Old Southwark.

SchoolYear of inspectionFrameworkGrade
Bermondsey and Southwark secondary schools

Let’s look at Beverley and Holderness. It’s quite the contrast: the schools have been inspected using the same framework and within a year of each other, except for School F, which has no grade. This looks like good work by Ofsted: clear, actionable information.

SchoolYear of inspectionFrameworkGrade
Beverley and Holderness secondary schools

So what?

Ofsted’s role looks like it might be reformed in the next few years, as signalled in the recent White Paper’s promised review of regulations and the new HMCI arriving next year will have their own views. Geoff Barton has pre-empted this debate with some interesting thoughts on removing grades.

I’ve previously criticised Ofsted for not having a published theory of change articulating how their work achieves their desired goals while mitigating the apparent risks. Ofsted do acknowledge their responsibility to mitigate these risks in their recently published strategy.

If Ofsted had a clear theory of change, then informing the parental choice of schools would very likely be part of it. The information presented here suggests that Ofsted are not currently doing a great job of this. In some ways, these issues are inevitable given that around 800 secondaries have an inspection on the new framework, 1,800 have one on the old framework, and 600 do not have a published grade.

If Ofsted had a clear theory of change, then informing the parental choice of schools would very likely be part of it.

However, if Ofsted conducted focused, area-based inspections, they would effectively mitigate the issues of different years and different frameworks for more families. These inspections would involve inspecting all the schools within an area within a similar timeframe. This would enable more like-with-like comparisons, as is currently the case in Beverley and Holderness. There would always be some boundary cases, but it would make it more explicit for families that they are not comparing like-with-like.

It is still possible to combine this approach with targeted individual inspections of schools based on perceived risks. No doubt this approach would come with some downsides. But if we are serious about giving parents meaningful information to compare schools, why do we inspect schools the way that we do?

A bonus of this approach is that you could evaluate the impact of inspections using a stepped-wedge design – a fancy RCT – where the order that areas are inspected in is randomised.


Do Ofsted grades actually influence choices? Yes. Ofsted themselves are always keen to highlight this. There is also a clear association between grades and school popularity, which we can approximate by calculating the ratio of the number on roll to school capacity. A higher ratio means that the school is more popular. The trend is clear across primary and secondary.

Ofsted gradePrimarySecondary
Requires improvement0.850.81
Serious weaknesses0.790.86
Special measures0.790.74
Evidence use

The apple pie stage of evidence-informed practice

The unstoppable growth of evidence

In 1999, a young, Dr Robert Coe, published an excitingly titled Manifesto for Evidence Based Education. You can still read it; it’s very good. The updated version from 2019 is even better.

With deep foresight, Rob argued that:

“Evidence-based” is the latest buzz-word in education. Before long, everything fashionable, desirable and Good will be “evidence-based”. We will have Evidence-Based Policy and Evidence-Based Teaching, Evidence-Based Training – who knows, maybe even Evidence-Based Inspection…evidence, like motherhood and apple pie, is in danger of being all things to all people’.

Professor Rob Coe

It’s safe to say that we have now reached the apple pie stage of evidence in education: the recent White Paper promises evidence-based policy and Ofsted’s new strategy offers evidence-based inspection.

If you’re reading this, then, like me, you’re among the converted when it comes to the promise of evidence, but my faith is increasingly being tested. For a long time, I thought any moves to more evidence use was unquestionably good, but I now wonder if more evidence use is always a good thing

To see why, look at this sketch of what I think are plausible benefit-harm ratios for five different levels of evidence use.

Level 1: no use

This is what happens when teachers use their intuition and so on to make decisions. This provides a baseline against which we can judge the other levels of evidence use.

Level 2: superficial use

I think in some ways, this is what Rob foresaw. For me, this stage is characterised by finding evidence to justify, rather than inform decisions. The classic example of this is finding evidence to justify Pupil Premium spending long after the decision has been made.

I think this is a fairly harmless activity, and the only real downside is that it wastes time, which is why there is a slightly increased risk of an overall harm. Equally, I think it’s plausible that superficial use might focus our attention on more promising things, which could also increase the likelihood of a net benefit.

Level 3: emerging use

For me, this is where it starts to get risky. I also dare say that if you’re reading this, there’s a good chance that you fall into this category – at least for some things you do. So why do I think there’s such a risk of a net harm? Here are three reasons:

  1. Engaging with evidence is time consuming so there might be more fruitful ways of spending our time.
  2. We might misinterpret the evidence and make some overconfident decisions. There’s decent evidence that retrieval practice is beneficial, but some teachers and indeed whole schools have used this evidence to justify spending significant portions of lessons retrieving prior learning, which everything I know about teaching tells me is probably not helpful. This is an example of what some people have called a lethal mutation.
  3. There’s also a risk of overconfident decision-making. If we think that ​‘the evidence says’ we should do a particular approach, then there is a risk that we keep going at it despite the signals that it’s not having the benefit we hope.

Of course, even emerging evidence use may be immensely beneficial. I think there are three basic mechanisms by which evidence can help us to be more effective:

  1. Deciding what to do – for instance, the EEF’s tiered model guides schools towards focusing on quality teaching, targeted interventions and wider strategies with the most effort going into quality teaching.
  2. Deciding what to do exactly – what is formative assessment exactly? This is a question I routinely ask teachers and the answers are often quite vague. Evidence can help us define quality.
  3. Deciding how to do things – a key insight from the EEF’s work is that both the what and the how matter. Effective implementation and professional development can be immensely valuable.

The interplay of these different mechanisms, and other factors, will determine for any single decision whether the net impact of evidence use is beneficial or harmful.

Level 4: developing use

At this level, we’re likely spending even more time engaging with evidence. But we’re also likely reaping more rewards.

I think the potential for dramatic negative impacts starts to be mitigated by better evaluation. At level three, we were relying on ​‘best bets’, but we had little idea of whether it was actually working in our setting. Although imperfect, some local evaluation protects us from larger net harms.

Level 5: sophisticated use

Wow – we have become one with research evidence. At this stage, we become increasingly effective at maximising the beneficial mechanisms outlined in level 3, but we do this with far greater quality.

Crucially, the quality of local evaluation is even better, which almost completely protects us from net harm – particularly over the medium to long-term. Also, at this stage, the benefits arguably become cumulative meaning that things get better and better over time – how marvellous!

So what?

The categories I’ve outlined are very rough and you will also notice that I have cunningly avoided offering a full definition of what I even mean by evidence-informed practice.

I think there are lots of implications of these different levels of evidence use, but I’ll save them for another day. What do you think? Is any evidence use a good thing? Am I being a needless gatekeeper about evidence use? Do these different levels have implications for how we should use evidence?

A version of this blog was originally published on the Research Schools Network site.

Evidence use

UCL trains as many teachers as the smallest 57 providers

Last year, I wrote two pieces about the potential of the ITT market review. I outlined two ways that the review could be successful in its own terms:

  1. Less effective providers would be removed from the market and replaced by others that are better – either new entrants to the market or existing stronger providers expanding.
  2. All providers make substantial improvements to their programme so that they achieve fidelity to the Core Content Framework and the like.

This week, we found out that 80 out of 212 applicants in the first round were successful. Schools Week described this as ‘savage’. Although a piece in the Tes suggests that many providers missed out due to minor issues or technicalities like exceeding the word limit.

A lot of stories led with the 80 successful providers figure and highlighted that this risks creating a situation where there are not enough providers. The graph below shows the number of teachers each provider trains – each bar is a different provider and the grey line is the cumulative percentage. An obvious take away is that providers vary massively by size.

One way to think about this is to look at the extremes. In 2021, there were 233 providers, and if we split these up into thirds:

  • The largest providers trained 27,000
  • The middle providers trained 5,000
  • The smallest providers trained 2,500

So instead of asking what proportion of providers got through, a more useful question might be what proportion of the capacity got through?

We can look at the same data with a tree map, only this time the shading highlights the type of provider. The universities are shown in light grey, while SCITTs are shown in darker grey. I’ve chosen to highlight this because while the DfE are clear that they are neutral on this matter, if you do consider size, the providers split fairly well into two camps.

So what?

This issue shows that it’s worth looking beyond the average.

I also think this dramatic variation in provider size suggests that maybe we haven’t got the process right. At the extreme end, the largest provider, UCL, trains the same number of teachers as the smallest 57 providers. I suspect that there is dramatic variation within the providers – should we try to factor this into the process?

Are we managing risks and allocating scrutiny rationally with a single approach that does not factor in the size of the organisation? Should we review different things as the scale that organisations work at varies since new risks, opportunities and challenges arise with scale?  

What else?

Averages are alluring, but headteachers will rightly care about what is going on in their area. I’m aware of a lot of anecdotal evidence and more systematic evidence that teachers tend not to move around a lot after their initial teacher training – I think this may be particularly true in the north east.

After the second round, there will be some cold spots. After all, there are already… Thinking through how to address this will be critical.

Evidence use

Book review: The Voltage Effect

Writing for Schools Week, I reviewed John A. List’s new book The Voltage Effect.

What connects Jamie’s Italian, an early education centre and Uber? Economist John A. List argues persuasively that the answer is scale, and that it is an issue relevant to everyone.

Scale permeates education, yet is rarely considered systematically. And this means some things fail for predictable reasons. For instance, there is really promising evidence about tuition, but the challenges of scaling this rapidly have proved considerable. At the simplest level, we scale things when we extend them; this might happen across a key stage, subject, whole school, or group of schools.

Scalability is crucial to national conversations such as the current focus on teacher development, Oak National Academy and the expansion of Nuffield Early Language Intervention – now used by two-thirds of primary schools. Sadly, List explains that things usually get worse as they scale: a voltage drop.

Evidence generation

The logic of randomisation

Randomised controlled trials (RCTs) are an important tool for evidence-informed practice. At times, I think that trials are both brilliant and simple; at other times, I find them frustrating, complicated, and even useless. Quality professional dialogue about evidence in education requires a firm grasp of the principles of trials, so I’m going to attempt to explain them.

In education, we have lots of different questions. We can classify these into descriptive, associative, and causal questions. Randomised controlled trials are suited to answering this final type of question. These kinds of questions generally boil down to ‘what works’. For instance, does this professional development programme improve pupil outcomes?

The fundamental problem of causal inference

To answer causal questions, we come up against the fundamental problem of causal inference whereby for every decision we make, we can only experience what happens if we do A or B: we cannot experience them both and compare them directly.

Suppose we have a single strawberry plant and we want to know if Super Fertiliser 3000 really is more effective than horse manure as the manufacturers claim. We can choose between the fertilisers, but we cannot do both. One solution is to invent a time machine: we could use it to experience both options, and it would be easy to decide in which conditions the strawberry plant grew best – simple.

The fundamental problem of causal inference is that we cannot directly experience the consequences of different options. Therefore, we need to find ways to estimate what would happen if we had chosen the other option, known as the counterfactual.

Estimating the counterfactual by using fair tests

Fair tests are the one scientific idea that I can be almost certain that every pupil arriving in Y7 will know. The key idea is that if we are going to compare two things, then we need to try to isolate the influence of the thing that we are interested in and keep everything else the same. It needs to be a fair comparison.

So in the case of the optimal growing conditions for our strawberry plant, we would want to keep things like the amount of light and water that the plants experience constant.

In theory, we could grow our plant in horse manure, and then replant it in Super Fertiliser 3000. To ensure that this is a fair test, we could repeat this process in a random order and carefully make our measurements. This would be a within-participant design. However, within-participant designs are really fiddly to get right and are rarely used in education outside of tightly controlled lab conditions. One particular challenge is that plant growth, just like learning, is at least semi-permanent, which makes it tricky to measure outcomes.

Instead , we can use a between-participant design where we take a group of strawberry plants (or pupils, teachers, or schools) and expose them to different interventions. To make this a fair test, we need to ensure that the groups are comparable and would – without any intervention – achieve similar outcomes.

So how do we go about creating similar groups? One wonderfully impractical option, but beloved by many scientific fields, is to use identical twins. The astronaut Scott Kelly went to the international space station, while his twin brother remained on Earth. It was then possible to investigate the effect of space on Scott’s health by using his brother as an approximation of what would have happened if he had not gone to space. These kinds of studies are often very good estimates of the counterfactual, but remember they are still not as good as our ideal option of building a time machine and directly observing both conditions.

Twin studies have yielded a lot of insights, but they’re probably not the answer to most questions in education. What if instead we just try and find individuals that are really similar? We could try to ‘match’ the strawberry plants, or people, we want to study with other ones that are very similar. We could decide on some things that we think matter, and then ensure that the characteristics of our groups were the same. For instance, in education, we might think that it is important to balance on pupils’ prior attainment, the age of the children, and proportion of pupils with SEND.

If we created two groups like this, would they be comparable? Would it be a fair test?

Observable and unobservable differences

To understand the limitations of this matching approach, it is useful to think about observable and unobservable differences between the groups. Observable differences are things that we can observe and measure – like the age of the pupils – while unobservable differences are things that we cannot or did not measure.

The risk with creating ‘matched’ groups is that although we may balance on some key observable differences, there may be important unobservable differences between the groups. These unobservable differences between the groups could then influence the outcomes that we care about – in other words, it would not be a fair test.

Frankly, this all sounds a bit rubbish – how are we ever going to achieve total balance on all of the the factors that might influence pupil outcomes? Even if we knew what all the factors were, it would be a nightmare to measure them.

The benefit of randomly assigning units to groups is that we can forget about observed and unobserved differences since the random allocation will mean they cancel each other out. A single RCT may favour one group over another, but over time these differences will not systematically favour one group hence the term unbiased causal inference.

We wanted correct causal inference, but unfortunately have to settle for unbiased causal inference – this is important to remember when interpreting findings from trials. This is a key reason why all trials need publishing, why we need more replication, and why we need to synthesise findings systematically.

Some misconceptions

Random allocation is a topic that seems quite simple at first – it is ultimately about creating comparable groups – but once you dig deeper, it has a lot of complexity to it. I follow a few medical statisticians on Twitter who routinely complain about clinicians and other researchers’ misunderstanding about random allocation.

Here’s a quick rundown of three widespread misunderstandings.

Random sampling and random allocation are different. They are designed to achieve different things. Random allocation to treatment groups is designed to enable unbiased causal inference. In short, it is about internal validity. While random sampling from a population is intended to achieve a representative sample, which in turn can make it easier to make generalisations about the population so random sampling is more about external validity.

Even weak RCTs are better than other studies. Random allocation is great, but it is not magic, and weak RCTs can lead us to make overconfident, incorrect inferences. Anything that threatens the fair test principle of an RCT is an issue. One widely noted issue is attrition whereby some participants withdraw from a study, which can effectively undo the randomisation. This briefing note from the IES is very helpful.

RCTs need to achieve balance on every observable. They don’t. This is a bit of a nightmare of a topic, with a lot of disagreement on how this issue should be tackled. If you want to jump into the deep end on the limitations of trials, then a good starting point is this paper from Angus Deaton and Nancy Cartwright.

Evidence use

What’s the role of phonics in secondary school?

Writing for Tes, I argue that we need to think carefully about how we use phonics in secondary.

The emphasis on phonics in English primary schools has increased dramatically since 2010, which makes the existing evidence less relevant to pupils who didn’t respond well to phonics teaching.

Even in recent research, older, struggling readers were often receiving systematic phonics teaching for the first time, particularly in studies outside of England. At best, these findings overstate the impact that we might expect.

I think there are some specific points about phonics that are interesting here, but I also think this highlights some wider principles about evidence-informed practice.

I think of this as being akin to the half-life of research. I was first introduced to this idea years ago, based on an interpretation of the evidence about peer tutoring offered by Robbie Coleman and Peter Henderson.

Evidence use

Do you really do it already?

Writing for the Research Schools Network, I challenge the notion that ‘we do it already’.

As we continue delivering the Early Career Framework (ECF) programme, we continue listening to our partner schools as well as the national debate. We are using this to both refine our work with schools, and to inform our engagement with national organisations.

One theme we have identified where we think we can clarify – and indeed even challenge schools – is the perception that ​‘we already do this’, ​‘we already know this’, or ​‘we learnt this during initial teacher training’.

As a science teacher, one issue I regularly encounter is that pupils recall doing practicals. They may have completed them at primary school, or perhaps earlier in their time at secondary school. One thing I point out is the difference between being familiar with something – like ​‘we remember doing this’ – and understanding something deeply.

My impression is that a similar phenomenon occurs with the ECF, which is not surprising given the overlap in content with the Core Content Framework that underpins ITT programmes. Indeed, I would argue that a similar phenomenon occurs with most teacher development. As teachers, we are often interested in the next big thing, but as Dylan Wiliam has argued, perhaps we should instead focus on doing the last big thing properly.

One way that I have previously tried to illustrate this point is with assessment for learning. These ideas were scaled up through various government initiatives since the late 1990s such that if you taught during this time, it is unlikely you did not have some professional development about it.

Given this, I sometimes cheekily ask teachers to define it. I might give them a blank sheet of paper and some time to discuss it. There are always some great explanations, but it is also fair to say that this normally challenges teachers. Typically, the features of that teachers first highlight are the more superficial aspects, such as asking lots of questions or using a random name generator.

One thing I point out is the difference between being familiar with something – like ​‘we remember doing this’ – and understanding something deeply

No doubt given longer, we could reach better descriptions, but even experienced teachers can struggle to describe the deep structures of formative assessment, which Wiliam has described as his five principles. I have tried to summarise this by describing different levels of quality shown below. We could argue about what goes into each box – indeed, I think this would be an excellent conversation – but it hopefully illustrates that it is possible to use assessment for learning with different levels of quality.

In addition to these principles, there is also a deep craft to formative assessment. Arguably, this is what separates good from great formative assessment. Thus, it is not totally different things, but it is the sophistication and nuance that matters.

The need for repetition and deep engagement with ideas is not just my opinion, it is a key tenet of the EEF’s professional development guidance report. Further, the EEF evaluated a two-year programme focusing entirely on Embedding Formative Assessment, which led to improvements on GCSE outcomes, which are notoriously difficult to improve in research studies.

This project involved teachers meeting monthly to spend 90-minutes discussing key aspects of formative assessment. Unsurprisingly, some of these teachers too reported that they were familiar with the approach, yet the evidence is clear that this approach to improving teaching was effective.

Finally, there is some truth to the issues raised about repetition, and I think that ECTs and Mentors are right to protest that they have they have encountered some activities before. This is probably not very helpful. However, there is a big difference between just repeating activities and examining a topic in more depth. The DfE have committed to reviewing all programmes ahead of the next cohort in September, and I hope this distinction is recognised.

Level of qualityAssessment for learning example
Level 1: superficial​. Colleagues are aware of some superficial aspects of the approach, and may have some deeper knowledge, but this is not yet reflected in their practice.​“Formative assessment is about copying learning objectives into books, and teachers asking lots of questions. Lollypop sticks and random name generators can be useful.”​
“Formative assessment is about copying learning objectives into books, and teachers asking lots of questions. Lollypop sticks and random name generators can be useful.”​“Formative assessment is about regularly checking pupils’ understanding, and then using this information to inform current and future lessons. Effective questioning is a key formative assessment strategy”​
Level 3: developing. Colleagues have a clear understanding of the key dos and don’ts, which helps them to avoid common pitfalls. However, at this level, it is still possible for these ideas to be taken as a procedural matter. Further, although the approach may be increasingly effective, it is not yet as efficient as it might be. Teachers may struggle to be purposeful and use the approaches judiciously as they may not have a secure enough understanding – this may be particularly noticeable with how the approach may need tailoring to different phases, subjects, or contexts.​“Formative assessment is about clearly defining and sharing learning intentions. Then carefully eliciting evidence of learning and using this information to adapt teaching. Good things to do, include eliciting evidence of learning in a manner that can be gathered and interpreted in an efficient and effective manner; using evidence of learning to decide how much time to spend on activities and when to remove scaffolding. Things to avoid include making inferences about all pupils’ learning using a non-representative sample, such as pupils who volunteer answers; mistaking pupils’ current level of performance with learning.”​
Level 4: sophisticated​. Colleagues have a secure understanding of the mechanisms that lead to improvement and how active ingredients can protect those mechanisms. This allows teachers to purposefully tailor approaches to their context without compromising fidelity to the core ideas.  Further, ideas at this level there is an increasing understanding of the conditional nature of most teaching and that there is seldom a single right way of doing things. Teaching typically involves lots of micro decisions, ‘if x then y’. There is also a growing awareness of trade-offs and diminishing returns to any activity. At this level, there is close thinking to how changes in teaching lead to changes in pupil behaviours, which in turn influence learning.​“I have a strong understanding of Wiliam’s five formative assessment strategies. Formative assessment allows me to become more effective and efficient in three main ways:​
1. Focuses on subject matter that pupils are struggling with​
2. Optimising the level of challenge – including the level of scaffolding – and allowing teachers to move on at an appropriate pace, or to increase levels of support.​
3. Developing more independent and self-aware learners who have a clearer understanding of what success looks like, which they can use to inform their actions in class as well as any homework and revision.”​