Evidence generation

The logic of randomisation

Randomised controlled trials (RCTs) are an important tool for evidence-informed practice. At times, I think that trials are both brilliant and simple; at other times, I find them frustrating, complicated, and even useless. Quality professional dialogue about evidence in education requires a firm grasp of the principles of trials, so I’m going to attempt to explain them.

In education, we have lots of different questions. We can classify these into descriptive, associative, and causal questions. Randomised controlled trials are suited to answering this final type of question. These kinds of questions generally boil down to ‘what works’. For instance, does this professional development programme improve pupil outcomes?

The fundamental problem of causal inference

To answer causal questions, we come up against the fundamental problem of causal inference whereby for every decision we make, we can only experience what happens if we do A or B: we cannot experience them both and compare them directly.

Suppose we have a single strawberry plant and we want to know if Super Fertiliser 3000 really is more effective than horse manure as the manufacturers claim. We can choose between the fertilisers, but we cannot do both. One solution is to invent a time machine: we could use it to experience both options, and it would be easy to decide in which conditions the strawberry plant grew best – simple.

The fundamental problem of causal inference is that we cannot directly experience the consequences of different options. Therefore, we need to find ways to estimate what would happen if we had chosen the other option, known as the counterfactual.

Estimating the counterfactual by using fair tests

Fair tests are the one scientific idea that I can be almost certain that every pupil arriving in Y7 will know. The key idea is that if we are going to compare two things, then we need to try to isolate the influence of the thing that we are interested in and keep everything else the same. It needs to be a fair comparison.

So in the case of the optimal growing conditions for our strawberry plant, we would want to keep things like the amount of light and water that the plants experience constant.

In theory, we could grow our plant in horse manure, and then replant it in Super Fertiliser 3000. To ensure that this is a fair test, we could repeat this process in a random order and carefully make our measurements. This would be a within-participant design. However, within-participant designs are really fiddly to get right and are rarely used in education outside of tightly controlled lab conditions. One particular challenge is that plant growth, just like learning, is at least semi-permanent, which makes it tricky to measure outcomes.

Instead , we can use a between-participant design where we take a group of strawberry plants (or pupils, teachers, or schools) and expose them to different interventions. To make this a fair test, we need to ensure that the groups are comparable and would – without any intervention – achieve similar outcomes.

So how do we go about creating similar groups? One wonderfully impractical option, but beloved by many scientific fields, is to use identical twins. The astronaut Scott Kelly went to the international space station, while his twin brother remained on Earth. It was then possible to investigate the effect of space on Scott’s health by using his brother as an approximation of what would have happened if he had not gone to space. These kinds of studies are often very good estimates of the counterfactual, but remember they are still not as good as our ideal option of building a time machine and directly observing both conditions.

Twin studies have yielded a lot of insights, but they’re probably not the answer to most questions in education. What if instead we just try and find individuals that are really similar? We could try to ‘match’ the strawberry plants, or people, we want to study with other ones that are very similar. We could decide on some things that we think matter, and then ensure that the characteristics of our groups were the same. For instance, in education, we might think that it is important to balance on pupils’ prior attainment, the age of the children, and proportion of pupils with SEND.

If we created two groups like this, would they be comparable? Would it be a fair test?

Observable and unobservable differences

To understand the limitations of this matching approach, it is useful to think about observable and unobservable differences between the groups. Observable differences are things that we can observe and measure – like the age of the pupils – while unobservable differences are things that we cannot or did not measure.

The risk with creating ‘matched’ groups is that although we may balance on some key observable differences, there may be important unobservable differences between the groups. These unobservable differences between the groups could then influence the outcomes that we care about – in other words, it would not be a fair test.

Frankly, this all sounds a bit rubbish – how are we ever going to achieve total balance on all of the the factors that might influence pupil outcomes? Even if we knew what all the factors were, it would be a nightmare to measure them.

The benefit of randomly assigning units to groups is that we can forget about observed and unobserved differences since the random allocation will mean they cancel each other out. A single RCT may favour one group over another, but over time these differences will not systematically favour one group hence the term unbiased causal inference.

We wanted correct causal inference, but unfortunately have to settle for unbiased causal inference – this is important to remember when interpreting findings from trials. This is a key reason why all trials need publishing, why we need more replication, and why we need to synthesise findings systematically.

Some misconceptions

Random allocation is a topic that seems quite simple at first – it is ultimately about creating comparable groups – but once you dig deeper, it has a lot of complexity to it. I follow a few medical statisticians on Twitter who routinely complain about clinicians and other researchers’ misunderstanding about random allocation.

Here’s a quick rundown of three widespread misunderstandings.

Random sampling and random allocation are different. They are designed to achieve different things. Random allocation to treatment groups is designed to enable unbiased causal inference. In short, it is about internal validity. While random sampling from a population is intended to achieve a representative sample, which in turn can make it easier to make generalisations about the population so random sampling is more about external validity.

Even weak RCTs are better than other studies. Random allocation is great, but it is not magic, and weak RCTs can lead us to make overconfident, incorrect inferences. Anything that threatens the fair test principle of an RCT is an issue. One widely noted issue is attrition whereby some participants withdraw from a study, which can effectively undo the randomisation. This briefing note from the IES is very helpful.

RCTs need to achieve balance on every observable. They don’t. This is a bit of a nightmare of a topic, with a lot of disagreement on how this issue should be tackled. If you want to jump into the deep end on the limitations of trials, then a good starting point is this paper from Angus Deaton and Nancy Cartwright.