Evidence generation

The DfE’s Evaluation Strategy

This week the DfE published an evaluation strategy. To my knowledge, this is the first time the department has published one. I think they should be applauded for taking evaluation increasingly seriously, but I want to offer some unsolicited feedback.

Demonstrating impact

The webpage hosting the report describes ‘how the Department for Education will work towards using robust evaluation practices to demonstrate impact and build our evidence base’.

‘Demonstrating impact’ presupposes that there is any impact, yet the consistent finding from evaluations is that it is rare to see programmes that genuinely do make an impact. The sociologist Peter Rossi described this as the iron law of evaluation: the expected value of any net impact assessment of any large scale social program is zero.

The reason ‘demonstrating impact’ concerns me is that it exposes a genuine misunderstanding about the purpose of evaluation. Evaluation can be highly technical, but understanding the purposes of evaluation should be possible for everyone.

The purposes of evaluation

I think evaluation has two purposes for a government organisation. First, there is an aspect of accountability. All politicians make claims about what they will do and evaluation is an important means of holding them to account and improving the quality of public debate.

I think of this like we are letting politicians drive the car. They get to decide where they go and how they drive, but we need some objective measures of their success – did they take us where they said they would? How much fuel did they use? Did they scratch the car? Evaluation can help us decide if we want to let politicians drive the car again.

The second purpose of evaluation is to build useable knowledge that can be used in the future. I am much more interested in this purpose. Continuing our analogy, this includes things like learning where the potholes are and what the traffic conditions are like so that we can make better choices in the future.

It is not possible to evaluate everything, so the DfE need to prioritise. The strategy explains that it will prioritise activities that are:

  1. High cost
  2. Novel, or not based on a strong evidence-base
  3. High risk

I completely understand why these areas were chosen, but I think these criteria are really dumb. I think they will misallocate evaluation effort and fail to serve the purposes of democratic accountability or building useable knowledge.

If we wanted to prioritise democratic accountability, we would focus evaluations on areas where governments had made big claims. Manifestos and high-profile policies would likely be our starting point.

If we wanted to build useable knowledge, we might instead focus on criteria like:

  1. Is it a topic where evidence would change opinions?
  2. What is the scale of the policy?
  3. How likely is the policy to be repeated?
  4. How good an evaluation is it likely that we can achieve?
  5. Is there genuine uncertainty about aspects of the programme?

Leading an evaluative culture

The foreword by Permanent Secretary Susan Acland-Hood is encouraging as it suggests a greater emphasis on evaluation. It also makes the non-speciifc, but encouraging commitment that ‘this document is a statement of intent by myself and my leadership team to take an active role in reinforcing our culture of robust evaluation. To achieve our ambition, we will commit to work closely with our partners and stakeholders’.

The foreword also notes the need for a more evaluative culture across Whitehall. I think a starting point for this is to clarify the purposes of evaluation and demonstrating impact is not the correct answer.

A gentle way of creating an evaluative culture might be to increasingly introduce evaluations within programmes, like the recently published ‘nimble evaluations’ as part of the National Tutoring Programme. These approaches help to optimise programmes without the existential threat of finding they did not work.

Another way of creating an evaluation culture would be to focus more on the research questions we want to answer and offer incentives for genuinely answering them.

Evidence generation

The logic of randomisation

Randomised controlled trials (RCTs) are an important tool for evidence-informed practice. At times, I think that trials are both brilliant and simple; at other times, I find them frustrating, complicated, and even useless. Quality professional dialogue about evidence in education requires a firm grasp of the principles of trials, so I’m going to attempt to explain them.

In education, we have lots of different questions. We can classify these into descriptive, associative, and causal questions. Randomised controlled trials are suited to answering this final type of question. These kinds of questions generally boil down to ‘what works’. For instance, does this professional development programme improve pupil outcomes?

The fundamental problem of causal inference

To answer causal questions, we come up against the fundamental problem of causal inference whereby for every decision we make, we can only experience what happens if we do A or B: we cannot experience them both and compare them directly.

Suppose we have a single strawberry plant and we want to know if Super Fertiliser 3000 really is more effective than horse manure as the manufacturers claim. We can choose between the fertilisers, but we cannot do both. One solution is to invent a time machine: we could use it to experience both options, and it would be easy to decide in which conditions the strawberry plant grew best – simple.

The fundamental problem of causal inference is that we cannot directly experience the consequences of different options. Therefore, we need to find ways to estimate what would happen if we had chosen the other option, known as the counterfactual.

Estimating the counterfactual by using fair tests

Fair tests are the one scientific idea that I can be almost certain that every pupil arriving in Y7 will know. The key idea is that if we are going to compare two things, then we need to try to isolate the influence of the thing that we are interested in and keep everything else the same. It needs to be a fair comparison.

So in the case of the optimal growing conditions for our strawberry plant, we would want to keep things like the amount of light and water that the plants experience constant.

In theory, we could grow our plant in horse manure, and then replant it in Super Fertiliser 3000. To ensure that this is a fair test, we could repeat this process in a random order and carefully make our measurements. This would be a within-participant design. However, within-participant designs are really fiddly to get right and are rarely used in education outside of tightly controlled lab conditions. One particular challenge is that plant growth, just like learning, is at least semi-permanent, which makes it tricky to measure outcomes.

Instead , we can use a between-participant design where we take a group of strawberry plants (or pupils, teachers, or schools) and expose them to different interventions. To make this a fair test, we need to ensure that the groups are comparable and would – without any intervention – achieve similar outcomes.

So how do we go about creating similar groups? One wonderfully impractical option, but beloved by many scientific fields, is to use identical twins. The astronaut Scott Kelly went to the international space station, while his twin brother remained on Earth. It was then possible to investigate the effect of space on Scott’s health by using his brother as an approximation of what would have happened if he had not gone to space. These kinds of studies are often very good estimates of the counterfactual, but remember they are still not as good as our ideal option of building a time machine and directly observing both conditions.

Twin studies have yielded a lot of insights, but they’re probably not the answer to most questions in education. What if instead we just try and find individuals that are really similar? We could try to ‘match’ the strawberry plants, or people, we want to study with other ones that are very similar. We could decide on some things that we think matter, and then ensure that the characteristics of our groups were the same. For instance, in education, we might think that it is important to balance on pupils’ prior attainment, the age of the children, and proportion of pupils with SEND.

If we created two groups like this, would they be comparable? Would it be a fair test?

Observable and unobservable differences

To understand the limitations of this matching approach, it is useful to think about observable and unobservable differences between the groups. Observable differences are things that we can observe and measure – like the age of the pupils – while unobservable differences are things that we cannot or did not measure.

The risk with creating ‘matched’ groups is that although we may balance on some key observable differences, there may be important unobservable differences between the groups. These unobservable differences between the groups could then influence the outcomes that we care about – in other words, it would not be a fair test.

Frankly, this all sounds a bit rubbish – how are we ever going to achieve total balance on all of the the factors that might influence pupil outcomes? Even if we knew what all the factors were, it would be a nightmare to measure them.

The benefit of randomly assigning units to groups is that we can forget about observed and unobserved differences since the random allocation will mean they cancel each other out. A single RCT may favour one group over another, but over time these differences will not systematically favour one group hence the term unbiased causal inference.

We wanted correct causal inference, but unfortunately have to settle for unbiased causal inference – this is important to remember when interpreting findings from trials. This is a key reason why all trials need publishing, why we need more replication, and why we need to synthesise findings systematically.

Some misconceptions

Random allocation is a topic that seems quite simple at first – it is ultimately about creating comparable groups – but once you dig deeper, it has a lot of complexity to it. I follow a few medical statisticians on Twitter who routinely complain about clinicians and other researchers’ misunderstanding about random allocation.

Here’s a quick rundown of three widespread misunderstandings.

Random sampling and random allocation are different. They are designed to achieve different things. Random allocation to treatment groups is designed to enable unbiased causal inference. In short, it is about internal validity. While random sampling from a population is intended to achieve a representative sample, which in turn can make it easier to make generalisations about the population so random sampling is more about external validity.

Even weak RCTs are better than other studies. Random allocation is great, but it is not magic, and weak RCTs can lead us to make overconfident, incorrect inferences. Anything that threatens the fair test principle of an RCT is an issue. One widely noted issue is attrition whereby some participants withdraw from a study, which can effectively undo the randomisation. This briefing note from the IES is very helpful.

RCTs need to achieve balance on every observable. They don’t. This is a bit of a nightmare of a topic, with a lot of disagreement on how this issue should be tackled. If you want to jump into the deep end on the limitations of trials, then a good starting point is this paper from Angus Deaton and Nancy Cartwright.

Evidence generation

Research ethics need a new responsibility to teachers

Writing for Schools Week, I argue that researchers need a clearer responsibility to research users.

At present, most ethical considerations focus in on participants, not the far greater number of research users. This has a range of negative consequences.

A new ethical responsibility – reinforced by tasty carrots, and pointy sticks – is needed.

Evidence generation

Questions first, methods second

Research tries to answer questions. The range of education research questions is vast: why do some pupils truant? What is the best way to teach fractions? Which pupils are most likely to fall behind at school? Is there a link between the A-levels pupils study and their later earnings in life?

Despite the bewildering array of questions, education research questions can be put into three main groups.

  1. Description. Aims to find out what is happening, like how many teachers are there in England? What is the average KS2 SAT score in Sunderland?
  2. Association. Aims to find patterns between two or more things, like do pupils eligible for free school meals do worse at GCSE than their more affluent peers?
  3. Causation. Aims to answer if one thing causes another, like does investing in one-to-one tuition improve GCSE history outcomes?

The research question determines the method

A really boring argument is what is the best type of research. Historically, education has been plagued with debates about the merits of qualitative versus quantitative research. 

A useful mantra is questions first, methods second. Quite simply some methods are better suited to answer some questions than others. A good attempt to communicate this comes from the Alliance for Useful Evidence’s report, ‘What Counts As Good Evidence?

Have a go at classifying these questions into the three categories of description, association, or causation.

  1. How many teachers join the profession each year in England?
  2. What percentage of children have no breakfast?
  3. How well on SATS do children do who have no breakfast?
  4. Does running a breakfast club improve pupils’ SATS scores?
  5. How prevalent is bullying in England’s schools?
  6. Are anti-bullying interventions effective at stopping bullying?
  7. Does reading to dogs improve pupils’ reading?
  8. Is it feasible to have a snake as a class pet?
  9. Is there a link between school attendance and pupil wellbeing?
  10. Does marking work more often improve science results?

Answers: 1) descriptive 2) descriptive 3) associative 4) causal 5) descriptive 6) causal 7) causal 8) descriptive 9) associative 10) causal

Finally, if you want a fantastic guide to research questions, then Patrick’s White’s Developing Research Questions is an excellent read.

Evidence generation

Unmasking education trials

Recent weeks have seen a series of exciting announcements about the results of randomised controlled trials testing the efficacy of vaccines. Beyond the promising headlines, interviews with excited researchers have featured the phrase ‘unmasking’. But what is unmasking, and is it relevant to trials in education?

Unmasking is the stage in a trial when researchers find out whether each participant is in the control or intervention group. In healthcare, there are up to three ways that trials can be masked. First, participants may be unaware whether they are receiving the intervention; second, practitioners leading the intervention, like nurses providing a vaccination, may not know which participants are receiving the intervention; third, the researchers leading the trial and analysing the data may not know which treatment each participant receives.

Each of these masks, also known as blinding, is designed to prevent known biases. If knowledge of treatment allocation changes the behaviour of stakeholders – participants, practitioners, researchers – this may be misattributed to the intervention. For instance, in a trial testing vaccinations, participants who know that they have received the real vaccine may become more reckless, which could increase their risk of infection; practitioners may provide better care to participants they know are not getting the vaccine; researchers may make choices – consciously or sub-consciously – that favour their preferred outcomes.

Unmasking is the stage in a trial when researchers find out whether each participant is in the control or intervention group

These various risks are called social interaction threats, and each has various names. Learning the names is interesting, but I find it helpful to focus on their commonalities: they all stem from actors in the research changing their behaviour based on treatment allocation. The risk is that these can lead to apparent effects that are misattributed to the intervention.

  • Diffusion or imitation of treatment is when the control group starts doing – or at least attempts – to imitate the intervention.
  • Compensatory rivalry is when the control group puts in additional effort to ‘make up’ for not receiving the intervention.
  • Resentful demoralisation is the opposite of compensatory rivalry because the control group become demoralised after finding our they will miss out on the intervention.
  • Compensatory equalisation of treatment is when practitioners act favourably towards participants they perceive to be getting the less effective intervention.

So what does this all have to do with education?

It is easy to imagine how each threat could become a reality in an education trial. So does it matter that masking is extremely rare in education? Looking through trials funded by the Education Endowment Foundation, it is hard to find any that mention blinding. Further, there is limited mention in the EEF’s guidance for evaluators.

It would undoubtedly help if trials in education could be masked, but there are two main obstacles. First, there are practical barriers to masking – is it possible for a teacher to deliver a new intervention without knowing they are delivering it? Second, it could be argued that in the long list of things that need improving about trials in education, masking is pretty low down the list.

Although it is seldom possible to have complete masking in education, there are practical steps that can be taken. For instance:

  • ensuring that pre-testing happens prior to treatment allocation
  • ensuring that the marking, and ideally invigilation, of assessments is undertaken blind to treatment allocation
  • incorporating aspects of ‘mundane realism’ to minimise the threats of compensatory behaviours
  • analysing results blind to treatment allocation, and ideally guided by a pre-specified plan; some trials even have an independent statistician lead the analysis
  • actively monitoring the risk of each of these biases

I do not think we should give up all hope of masking in education. In surgery, so-called ‘sham’ operations are sometimes undertaken to prevent patients from knowing which treatment they have received. These involve little more than making an incision and then stitching it back up. It is possible to imagine adapting this approach in education.

We should also think carefully about masking on a case-by-case basis as some trials are likely at greater risk of social threats to validity than others. For instance, trials where control and intervention participants are based in the same school, or network of schools, are likely at the greatest risk of these threats.

In conclusion, a lack of masking is not a fatal blow to trials in education. We should also avoid thinking of masking as an all or nothing event. As Torgerson and Torgerson argue, there are different ways that masking can be undertaken. Taking a pragmatic approach where we (1) mask where possible, (2) consider the risks inherent in each trial and (3) closely monitor for threats when we cannot mask is probably a good enough solution. At least for now.

Evidence generation

A manifesto for better education products

It seems a lifetime ago that the BBC was pressured to remove its free content for schools by organisations intent on profiting from the increased demand for online learning materials.

As schools and a sense of normality return, however, the likelihood is that schools will continue to need to work digitally – albeit more sporadically. It is therefore crucial that we fix our systems now to avoid a messy second wave of the distance learning free-for-all.

Here’s my fantasy about how we could do things differently so that the interests of everyone involved are better aligned. Let’s be clear, there are real challenges, but change is possible.

To make my point, I want to start with the low hanging fruit of Computer Assisted Instruction – think Seneca or Teach Your Monster to Read.

Computer Assisted Instruction is widely used. There is an abundance of platforms – each with their strengths – but none of them ever quite satisfy me and using multiple platforms is impractical as pupils spend more time logging in than learning.

Using three guiding principles, I think we could have a better system.

Principle 1: fund development, not delivery

Who do you think works for organisations offering Computer Assisted Instruction? That’s right, salespeople.

Curriculum development is slow, technical work so developers face high initial costs. As they only make money once they have a product, the race is on to create one and then flog it – hence the salespeople.

The cost of one additional school using the programme is negligible. Therefore, we could properly fund the slow development and then make the programmes free at the point of use.

Principle 2: open source materials

Here’s a secret: if you look at curriculum mapped resources and drill down to the detail, it’s often uninspiring because – as we have already seen – rationale developers create a minimum viable product before hiring an ace sales team.

Our second principle is that anything developed has to be made open source – in a common format, freely available – so that good features can be adopted and improved.

This approach could harness the professional generosity and expertise that currently exists in online subject communities, like CogSciSci and Team English.

Principle 3: try before we buy

Most things do not work as well as their producers claim – shocking, I know. When the EEF tests programmes only around a quarter are better than what schools already do.

Our third principle, is that once new features, like a question set, have been developed, they have to be tested and shown to be better than what already exists, before they are rolled out.

By designing computer assisted instruction systems on a common framework – and potentially linking it to other data sources – we can check to see if the feature works as intended.

A worked example

Bringing this all together, we end up with a system that better focuses everyone’s efforts on improving learning, while also likely delivering financial savings.

It starts with someone having a new idea about how to improve learning – perhaps a new way of teaching about cells in Year 7 biology. They receive some initial funding to develop the resource to its fullest potential. The materials are developed using a common format and made freely available for other prospective developers to build on.

Before the new materials are released to students, they are tested using a randomised controlled trial. Only the features that actually work are then released to all students. Over time, the programme gets better and better.

The specifics of this approach can be adjusted to taste. For instance, we could pay by results; ministers could run funding rounds linked to their priorities, like early reading; we could still allow choice between a range of high quality programmes.

Further, the ‘features’ need not be restricted to question sets; a feature could be a new algorithm that optimises the spacing and interleaving of questions; a dashboard that provides useful insights for teachers to use; a system to motivate pupils to study for longer.

This approach allows the best elements of different programmes to be combined, rather than locking schools into a single product ecosystem.

I think we can do better than the current system – but are we willing to think differently?