The Use and Abuse of Statistics by the Department for Work and Pensions

Yesterday saw the Department for Work and Pensions rolling out their latest attempt to get the long-term unemployed back to work. The new scheme, called “Help to Work”, is supposed to provide extra support to those who have been out of work for more than two years. Esther McVey, speaking on the Today programme on morning of the scheme’s start, said:
“This is significant extra support, it's about helping people into work . . . It is absolutely not about punishment . . . This is costing money. We are supporting people, we are getting people who know as much as they possibly can . . . helping people into work, that's what they do every day, help people into work and we're putting more support for more people to help them get into work and benefit from our long-term economic plan." 
[Source, starting around 1h 40] 

Unlike many government schemes, this one has been scientifically tested for efficacy. Ms McVey even used the study as evidence that the scheme will work:
“ . . . we saw with this extra support they were remaining off benefit but more importantly they were staying in work . . .”
[Source, starting around 1h 39]

But is this the case? Does the study back up these claims? 

Well, the answer is yes and no.

To explain what I mean I need a little digression into statistics, but first, the set-up. The study took just over 15,000 people on Jobseeker’s allowance (JSA) who had been unemployed and on the Work Programme (the programme that Help to Work is replacing) for two years. The study then divided these people at random into three groups: a control group, where they continued using the same services as before, and then two test groups: one which was put on the Community Action Programme (CAP) and the other which was put on the Ongoing Case Management (OCM). The study then compared the effects of these two test groups to the control to see if there were any differences. This is a classic experimental design everything looks sound so far.

The study began with the three groups doing what they were doing for 13 weeks before intervention so that any differences between the groups could be determined. Eventually, the groups were put on their different interventions and their progress was tracked in a number of ways: the number of days they were in employment or on benefits, the number of weeks they were employed or on benefits and the number of benefit claims or spells of employment they had over the 91 weeks that they were tracked.

So, what did they find? They found this:

[Source, page 12]

Note that * means that there was a statistically significant difference found at 95% confidence interval and ** means there is a statistically significant difference at 99% (the meaning of which I'll explain in a minute).

This is where I need the digression. When you’re doing a scientific study you need to pre-determine what is classified as a “win”. This is done so that you can’t pretend that your experiment worked when it really didn’t. Most studies test what is called a null hypothesis. This is the hypothesis that there is no difference between a control group and an experimental group. So in the case of the job centre study, the null hypothesis is that there is no difference between the control group and either the CAP or the OCM groups. Then statistical tests are done to determine if there really are differences. If there is a real difference then the difference is said to be statistically significant. This means that it is unlikely that this difference has happened by chance.

The standard measure is the 95% confidence interval. This means that if the null hypothesis was true, there would be a 0.05% chance that is result would have been produced. This may seem a weird way of phrasing it, surely we want to know if the null hypothesis is false? But it’s all to do with the cautiousness of statistics. Statisticians would rather accept a false null hypothesis (i.e. think there's no difference when there really is one) than reject a true one (i.e. think there is a difference when there really isn't one) because the consequences are often much worse for the latter. Likewise, a 99% confidence interval means there’s a 0.01% chance of the null hypothesis being true and still getting this result. Pretty unlikely!

In the table above you can see * and ** all over the place. This means that these results are statistically significant, which must be good. We can concluded that there’s a real difference between the control and the experimental groups. So the new interventions work and the Department of Work and Pensions is onto a sure-fire way to reduce unemployment in the long-term unemployed. Happy days are here again!

Not so fast. . . This is where we get to tackle the “no” part of my “yes and no” answer, and it’s here I must introduce the term “effect size”. The effect size is really important in studies like this. It is, as its name suggests, a measure of the size of an effect. You see, the thing about statistical significance tests is that they only tell you there is a difference, they say nothing about how big that difference is. For example, imagine you were doing a test to see if a new medicine prevented heart attacks. Your control group would be given the drug currently used and your experimental group would use the new drug. If you foundd a statistically significant difference between the two drugs your immediate reaction would probably be that the new drug is better. You’re not wrong but you need to know how much better before you decide to switch everyone from the old drug to the new drug. If there’s only a 1% difference between the effectiveness of the two drugs then it might not be worth swapping, especially if the new drug was more expensive or had worse side effects. If there was a 20% difference then the cost or side effects would probably be worth the switch and if there was a 50% difference then only the highest costs or worst side effects would stop you from switching. The effect size matters.

So what are the effect sizes in this study? Well, looking at the mean days in employment the CAP group had 9 more days and the OCM group had 11 more days. 91 weeks are 637 days so that equates to 1% more days in employment. Even the biggest effect, that of the OCM intervention on the number of days on benefit produced a 4% difference. Esther McVey admits that these schemes cost money yet all this additional funding is, at best, going to result in a 4% difference. Most damningly, the number of people in employment at the end of the trial was not significantly different from the control group for either of the interventions. The effect sizes are trivial.

The study has a rather disingenuous conclusion. It states:
“The results from the trailblazer are very positive, showing people assigned to Community Action Programme (CAP) and Ongoing Case Management (OCM) spent significantly less time on benefits and more time in employment in comparison to the control group, and this impact was sustained over a long time."
[Source, pp 22]

I agree that the results are positive, however I disagree with their interpretation. When they use the word “significantly” they are using it in the statistical sense, in that there is a 0.05% or 0.01% chance that a true null hypothesis would have produced these results. Yet when most people read that statement they’ll think it means "considerable": they’ll think it means that the effect size is large. The results categorically do not show this; they show the opposite. Their final sentence, about the trend being sustained over a long time is also disingenuous. This is a typical tracking graph:

 [Source, pp 14]

It is true that the group spent “less time on benefits” than the control and “this impact was sustained” but look at the trend. Look at the direction. It’s heading back to the control line. It looks like any positive effect was only temporary. What’s causing that? Well, I can think of one reason, which may well be incorrect but I think is worth considering. It’s called the Hawthorne Effect or the Observer Effect. When humans are being studied they change their behaviour. These changes are often be attributed to the intervention being studied but are actually due to the very act of being observed. It means that in studies where the subjects know they are in an experiment (and the people in the CAP and OCM groups will know they’re being given a new intervention) they will change their behaviour simply due to being in the study. The people in the control group won’t exhibit this effect because nothing has changed. It may also be that the staff in the experimental groups are more passionate at the start, really hoping that this may be the thing that helps these people to find work at long last. In the first few weeks and months everyone, staff and unemployed alike, are giving it all they’ve got but as the weeks go on, the novelty wears off and the realisation the scheme isn’t doing that much to help becomes apparent, enthusiasm wanes and they start to become more and more like the control groups.

This is, of course, pure speculation, but it would explain the trends that are seen. Other suggestions would be most welcome.

Esther McVey did not lie when she said that these interventions helped. But it’s easy to misuse statistics without actively lying. Tell an outright lie and people will call you on it. Misinterpret, draw misleading conclusions and use words deceptively and that way you are telling the truth, just not the whole truth. The study does show significant differences, but only in the statistical sense. In the sense that unemployed people want – a much better chance of getting a job – the study shows no such thing. 

-->
---------------------------------------------------------------

The new scheme has some incredible demands on unemployed people. Excessive and I’d argue punitive demands. I was hoping to get into the real world effects in this post but explaining the study took far longer than I was expecting so I will cover them next time.

Comments

Popular posts from this blog

Sexism vs cultural imperialism

The remarkable tree lobster

Gutting the DSA with dodgy statistics