Nicholas Gruen: evaluation knowledge comes not from numbers, but questions

As policymakers look back for answers to what worked — like PM&C’s $40m for evaluation of Indigenous programs — the pass/fail approach won’t, in itself, help find the answers that government will need as circumstances change.

                                     In memoriam: Bill Craven *

In 2005 Peter Shergold, the country’s most senior public servant said this:

“If there were a single cultural predilection in the APS that I would change, it would be the unspoken belief of many that contributing to the development of government policy is a higher order function – more prestigious, more influential, more exciting – than delivering results.”1

He spent another three years championing this idea from the top job. But then a decade later, reporting to Prime Minister Tony Abbott on the public service concluded that progress on the point had been scant.2

All of which serves to underline the point that the hierarchies that dictate policy are not just hierarchies of people, but also of knowledge. You can see the power of hierarchies of knowledge when it comes to Royal Commissions. When some shocking revelations came to light about South Australia’s child protection system, the Premier set up a Royal Commission. As others had done in child protection before him.

When a system full of people paid hundreds of dollars a day fails, we send in the lawyers – a profession which might not know anything about child protection –  but we pay them thousands of dollars a day. No-one ever got sacked for buying IBM or hiring Deloitte and surely those QCs can work out a thing or two about child protection. And we keep getting back answers that don’t work.

Rise of the econocrats

When I was a kid, law was the uber discipline. There was no other. But today there’s another discipline which is more powerful still. Its senior practitioners aren’t paid like QCs but they dominate the upper echelons of the public service. I’m speaking of my own profession – economics.

Economics has always chased Adam Smith’s grand vision of following Isaac Newton in building a vast disciplinary edifice from simple axiomatic foundations. Smith himself spoke of the Newtonian Method of rhetoric and it’s pretty obvious that he cast his two great books accordingly. Especially as Smith’s idea of economics as a moral science has given way to the more modern (perhaps I should say ‘modernist’) idea that we can codify Smith’s idea in formally specified models, this gives economics a relentless reductionism. That’s a great strength in many contexts. It simplifies things down to certain commonsensical basics and so it sweeps away a lot of undergrowth. Where we can get by adequately without that undergrowth, so much the better.

Nevertheless as powerfully as the radical abstractions of economics can help us get to the nub of a matter, they’re also a seductive invitation to ignore much that matters. In the world of policy, rather than take their discipline as Keynes suggested, a set of tools for structuring open-minded inquiry and exploration, many economists take their discipline to endorse settled conclusions which then become a badge of tribal identity, and an invitation to hubris.

Even that isn’t all downside. Economists’ pride in the rigour and hard-headedness of their discipline has made them champions for evidence-based policy. As the economists at the PC have pointed out, we are spending many billions of dollars on programs to promote aboriginal welfare with remarkably little attention to whether they work or not. It’s the economists at the PC supported by economists like Peter Shergold and his successor Martin Parkinson that have managed to get an additional $40 million allocated for evaluating programs for indigenous Australians which is a great opportunity for policy learning, and God knows we could do with some in that area.

But too impatient, too hubristic a quest for rigour can lead us astray.

Here’s the thing. In the last few months, I’ve made a point of asking a number of such people at very senior levels, econocrats who regard themselves as rusted on evidence-based policy people if they know what ‘program logic’ is. They don’t.

Overconfidence of the ‘gold standard’

What many champions of evidence-based policy have in mind is commonsense. We should have rigorous evaluation of new programs and pick ones that ‘work’. Hence the $40 million. And the best way to know if programs work is with randomised controlled trials (RCTs) which are often referred to as the ‘gold standard’ of evidence. Still as Sherlock Holmes put it in a somewhat different context, “there is nothing more deceptive than an obvious fact”.

We should certainly pay far more attention to independent validation of our knowledge of what works. Indeed it’s somewhat shocking that, for a country which is or at least was one of the best policy reformers in the world, we’ve always been a laggard when it comes to RCTs.

But I’m in good company when I tell you that RCTs are one among many tools but not quite the panacea they’re being made out to be. 2015 Nobel Laureate Angus Deaton agrees. 2000 laureate and one of the great econometricians of the last century James Heckman describes RCTs as “a metaphor and not a gold standard”.

The thing about RCTs is that they assure us of just one thing. To be precise; they give us a known degree of confidence that, at a particular time and place, a particular treatment had a particular effect.

The idea that RCTs are a gold standard seems appealing. But it also has its downsides. It collapses the difficult task of evidence-based policy into single, discrete routines, tips and tricks. For the knowledge from an RCT to be useful these routines, tips and tricks must work independent of context – or with some additional work to test their applicability.

Note two things about RCTs. Firstly it’s the view from the top. It’s certainly a major problem of social research and social policy that those working in the field can talk a good game about how their intervention is fundamental to addressing social harm and injustice. And there’s plenty of confused and wishful thinking amongst those in the field about the efficacy of the programs they run.

In this context an independent RCT is a very useful means by which those in senior policy positions can keep those delivering programs under surveillance – and force them into a more evidence based discourse for justifying their program.

So far so good.

Numbers don’t always give knowledge

But the second thing an RCT does is that, to be effective it must tame and confine the knowledge we’re after – of what works in the field – into a simplified, discrete question. This is an example of one of the pathologies of economics – as distinctive to our profession as wigs and gowns are to the fanciest lawyers. Instead of careful adaptation of our methods to the kind of knowledge that would be most useful, we presuppose that methods that resemble those used in science must give us the ‘gold standard’ knowledge. This is the intellectual vice that Friedrich Hayek anatomised and anathematised as “scientism”.

The high watermark of scientism is usually taken to be the words of Lord Kelvin in 1883 in which he argued that “when you cannot measure [something], when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind”. I guess it’s too bad for him that he chose to express this truth in words, not numbers. Indeed it might give us pause to realise that he couldn’t possibly express it in numbers.

Be that as it may, prestigious academic journals are happy to snap up good studies of such discrete questions, especially if a well designed and funded RCT is involved. But the risk is that the knowledge will be crude and decontextualised. There’s a deep academic literature on questions like “does performance pay for teachers or school vouchers, or charter schools improve student outcomes”. But the answer to these kinds of questions is usually that “it depends”. As Deborah Johnston puts it in discussing aid to Africa “It is an over-simplified and erroneous question to ask, do cash transfers work?”.3

A more productive question relating to the same subjects might be this: “in what kinds of circumstances might performance pay or school vouchers improve performance and what structures will help optimise outcomes”. One might be able to go back into the data collected for RCTs, and it may shed some light on those questions, but it will be hard work because the whole architecture of the study is focused on a singular question, not on helping to steer our way through a specific situation.

A great deal of the policy and delivery know-how we desperately need can’t be simplified discrete, context independent nuggets of knowledge. How does one improve mental health or domestic violence in outback communities, in the exasperated outer suburbs of our sprawling cities, or our regional towns? How do we do the best we can for children whose parents cannot or will not look after them properly.

Formal RCTs will be a small part of the progress we make on these questions. If we look at the way successful innovation works in most circumstances, most of it doesn’t get down to the implementation of single ideas that work largely irrespective of context. It usually requires considerable investigation, experimentation and coordination between different parts of systems with trade-offs carefully and collaboratively explored.

Of course this must be done as rigorously and as transparently as possible with assumptions behind a program – the program logic – tested along the way. In this context it’s possible to do all manner of mini-experiments which may take the form of a RCT, though it may be no more than A/B testing two ways to present a choice to a user, or various ways of wording a letter to program participants. Great innovators like Google and Amazon perform literally tens of thousands of such experiments every year, and ‘nudge units’ around the world are slowly taking these experiments closer to business-as-usual in government.4

The supremacy of questions

Given how capacious our ignorance is, and will always be, being humble and prepared to adapt one’s methods to the problem at hand is a good starting point for a discipline.

To explain why let me offer a confession. I’m a Collingwood supporter. Not my beleaguered football team. I’m talking about a philosopher I want to recommend to evaluators everywhere: R. G. Collingwood, whom I ran into when studying history at uni. History, you see is like evaluation in that it has no overarching theories to impose on its material. Unlike economics, it puts great store in attending to the material before it on its merits.

In any event, if you read R. G. Collinwood’s terrific little autobiography, which sketches his intellectual development, you’ll come across a story which he uses to explain where his philosophy starts. It starts with questions.

“Every day I walked across Kensington Gardens and past the Albert Memorial [which] began by degrees to obsess me .… Everything about it was visibly mis-shapen, corrupt, crawling, verminous; for a time I could not bear to look at it, and passed with averted eyes; recovering from this weakness, I forced myself to look, and to face .… the question: a thing so obviously, so incontrovertibly, so indefensibly bad, why had [the architect Gilbert] Scott done it? .… What relation was there, I began to ask myself, between what he had done and what he had tried to do? If I found the monument merely loathsome, was that perhaps my fault? Was I looking in it for qualities it did not possess, and either ignoring or despising those it did?”

For Collingwood, this slowly produced a revolution in his thinking. He came to believe that knowledge wasn’t captured in assertive propositions like this one “demand falls as price rises” or “increasing penalties for breach lowers tax evasion and dole cheating”. As he put it “knowledge comes only by answering questions”. And, in order to get anywhere, “these questions must be the right questions and asked in the right order”.

Not just what worked, but why

Just as natural science is the painstaking process of proposing hypotheses – or to use Collingwood’s terminology, asking questions which make specific phenomena examples of deeper patterns in nature, so program, developmental and other forms of evaluation unpick a program into its many moving parts, each having a role in the program logic so that each element of the logic can be investigated, validated, invalidated and/or optimised.

Thus evaluation becomes not just the investigation of what has worked. For knowledge of what has worked cannot, of itself help show the extent to which we can make it work better, or the extent to which it will still work as circumstances change. With apologies to Lord Kelvin, this is “knowledge of a meagre and unsatisfactory kind”.

Evaluation must also generate disciplined, transparent knowledge of why things work. And that kind of knowledge is a gateway both to greater insight as to what kinds of changes – in the program or in the context it operates – might affect the program’s efficacy. This is Collingwood’s idea that knowledge comes from, and can only come from, asking the right questions in the right order.

I’ll let him elaborate:

“For example, if my car will not go, I may spend an hour searching for the cause of its failure. If, during this hour, I take out number one plug, lay it on the engine, turn the starting-handle, and wait for a spark, my observation ‘number one plug is all right’ is an answer not to the question, ‘Why won’t my car go?’ but to the question, ‘Is it because number one plug is not sparking that my car won’t go?’ Any one of the various experiments I make during the hour will be the finding of an answer to some such detailed and particularized question. The question, ‘Why won’t my car go?’ is only a kind of summary of all these taken together.”

Antidote for wishful thinking

So there you have it – program evaluation a la R.G. Collingwood a few decades before the ideas were formalised into program evaluation. I commend him to you as an antidote to a lot of confused and wishful thinking at the top of our hierarchies of organisation and knowledge, in which the tale of a certain method and its apparent rigour wags the dog of what we need to know.

If you agree with me that these are some of the things we need to do, and particularly some of the things I’m hoping policy makers have in their mind as they work out how to spend that $40 million to build the evidence base in for indigenous programs, then we need to:

  • Build the status of delivery alongside policy;
  • Build the status of program evaluation over the blithe context independent presumptions of those arguing for the dominance of RCTs; and
  • Deliver evaluation which directly helps those in the field improve their efficacy; whilst at the same time
  • Generating transparency for those outside the program as to how it’s going; and
  • Find ways to generate knowledge that is transferrable, that helps us learn how to deliver more successful programs.

Is this possible, or is it a pipe dream?

I intend to take that question up when I address my proposal for an Evaluator General tomorrow.

This article is based on a speech by Nicholas Gruen to the AES 2017 conference dinner on September 5 in Canberra.

* On penning this article I wrote to someone from the ANU History Department who had tweeted my address asking to be put in touch with Bill Craven. He taught me “Ren and Ref” in 1977 and was the first person to introduce me to Collingwood’s way of thinking. For a long time I’ve entertained a vague idea that I should practice economics in a way that was informed by that thinking, but I’ve only come recently to reflect on it and articulate it – which I also did a little here. I discovered to my dismay that Bill died some time ago filling me with remorse at not having made contact sooner. But one becomes aware of things slowly, or I do anyway. As another philosopher said, “the Owl of Minerva spreads its wings only with the falling of the dusk”.

1. Shergold, Peter, 2005. … quoted in Mendham, ‘The State of Project Management’, CIO, 1 November

2. Here is a truth rarely admitted in the APS. Policy skills are generally viewed as ‘creative’ or ‘strategic’ while implementation skills are often perceived as ‘corporate’ or ‘operational.’ This outdated assumption can result in a bias towards promoting the former at the expense of the latter. It is premised on a falsehood. Shergold, Peter, 2015. Learning from failure: why large government policy initiatives have gone so badly wrong in the past and how the chances of success in the future can be improved, Australian Public Service Commission, Canberra.

3. Johnston, Deborah. 2015. ‘Paying the price of HIV in Africa: Cash transfers and the depoliticisation of HIV risk’. Review of African Political Economy 42(145): 394–413.

4. “Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day….” Jeff Bezos, see, eg “Last year at Google [2010] the search team ran about 6,000 experiments and implemented around 500 improvements based on those experiments. The ad side of the business did about the same. Any time you use Google, you are in many treatment and control groups. The learning from those experiments is fed back into production and the system continuously improves.” Hal Varian, Chief economist at Google,

  • To answer the closing question, I do not believe it is a pipe dream. Methods that capture observations or experiences from those directly affected by a proposed policy and have those people signify what matters about their contribution can deliver insights grounded in the context of the end user or client community.

    Most evaluations embody hypotheses or intrinsic assumptions that constrain and condition the process. It is possible to come at it the other way round. Instead of defining the framework of the evaluation from the top, provide an open space with loose constraints, just enough to position the exercise in the field of interest, within which the stakeholders can express their take on the subject from the bottom up.

    I realise that might sound a bit airy but there is a concrete set of methods to do this, they just aren’t widely known yet because they have only been out of the lab for a bit over ten years. The general approach is a form of sense making. I won’t mention specific products here but this can be carried out in face to face interactions and at scale through straightforward web tools.

    That still leaves the question of whether policy makers really want to know what people think matters, as opposed to building a framework of ideas within which they and politicians feel comfortable.

  • Nicholas Gruen

    Thanks Stephen,

    The rather fey ending is a product of the editing.

    The final line in this piece is a bit of a tease for the audience and in the full text and my speech, I invited them to come to my session the next day making out my case for an Evaluator General to get the answer (which is no – it’s not a pipe dream :)

    • I like the idea. It would provide a central focus for good practice apart from anything else – somewhere to argue about what constitutes good practice as well as try to see it implemented. However, there is a danger that we satisfy ourselves with simply (possibly not the best word) eliminating political bias from evaluation. That is necessary but not sufficient to link policy to a community’s needs and priorities.

      Even if they are free of bias, surveys and studies adopt overt and tacit hypotheses and assumptions that are evident in the subject matter and in the phrasing and presentation of questions. This confines the exercise to see the (human) system being examined through predefined windows, overlooking and excluding matters that the specialist analyst or evaluator has not conceived of exploring. In the complexity of a human system, the community to be served by a program, these blind spots cannot be eliminated by more and more detailed and diligent design work to refine the evaluation. Not only is the potential scale of the challenge, its cardinality, which is too large to be examined exhaustively by an external observer, the concerns and priorities of a community are always in flux. How often do we hear from those seeking to work with people in need or community representatives that a government’s proposals are neither good nor bad just missing the point?

      There will always be a need for an assessment of how well a program delivered on its objectives, a top down view. I believe that major improvements require us to allow the formulation of those objectives to flow up from the communities affected, allowing them to expose what really matters and to adjust this declaration as the world around them changes and the program in question starts to stimulate change and disrupt existing patterns.

      As an example of efforts to achieve this, see The primary input is selected by the respondent within the broad frame set by the title and the introduction to each collector. The respondent says what was most important about what they chose to contribute and does not have to think about whether to take advantage of or avoid the opportunity to give a glowing or damning reference. A few of the signifiers in these examples could be better as they should offer a balance between options that are all positive or all negative. The three way signifier is a powerful means of tapping into unadulterated feelings free from the respondents’ efforts to craft their response to give the provider what the respondent thinks the provider wants or to vent a background sentiment that might have little to do with the questions being asked.

      If anyone knows of a better way to tap into this sort of insight, I’d be keen to hear about it. The approach demonstrated in the Northern Ireland work is the only one I have found that can explore emergent insights.

      • Nicholas Gruen

        I’m not really au fait with it, but the sensemaker framework which seems to be what the link above illustrates has always seemed very interesting and, to the extent that I understand it, worthwhile.