Address to the Sir Roland Wilson Foundation 2023 Scholar Symposium, Australian National University, Canberra

Better evaluation builds a stronger public service: The Australian Centre for Evaluation

I acknowledge the Ngunnawal People, the traditional custodians of the land on which we gather today and recognise any other people or families with connection to the lands of the ACT and region.

I pay my respects to their Elders, extend that respect to all First Nations people present today, and commit myself to the implementation in full of the Uluru Statement from the Heart.

I also recognise the Sir Roland Wilson Foundation and its role in strengthening the links between academic research and public policy. It’s an objective close to my economist heart, and one that exemplifies the special role of the Australian National University. And it’s great to see the partnership extended to Charles Darwin University.

On that note, I would like to acknowledge the Foundation’s Chair, Dr Martin Parkinson, and the Foundation’s scholars. As you embark on your research, I encourage you to draw inspiration from your peers, those who have come before you and Roland Wilson’s legacy.

Roland Wilson was revered as an ‘intellectual force’ – one of the great public servants who helped shape economic policy over many decades (Sir Roland Wilson Foundation 2023). Every story I’ve heard about Roland Wilson suggests he was a problem solver, not just in an economic policy sense.

During World War Two petrol rationing, he famously built himself an electric vehicle out of junkyard scraps including a motor from an old crane and charger bulbs smuggled out of the United States (Simpson 2012). I wrote about it in a Canberra Times opinion column titled ‘Electric Vehicles Make the Weekend More Fun’, mainly so that the newspaper would run the photo of Wilson in his three-wheeled vehicle outside Old Parliament House (Leigh 2022).

In the era before we put water in Lake Burley Griffin, Wilson personally built a swimming pool in his backyard to cool off during the hot and dry Canberra summers (Treasury 2001). He must have needed the pool after all that excavation work.

Drawing my own inspiration from Wilson, I want to talk about an electrifying tool that can help the Australian Public Service better use its resources. It’s a tool that allows the public service to dig a bit deeper to solve policy problems and deliver effective programs.

That tool is the randomised trial.

Measuring what works

Randomised trials are a simple yet effective way to measure what works.

In medicine, randomised trials date back to James Lind’s work on scurvy, and Ambroise Paré’s work on treating battlefield burns.

The results from the latest randomised trials never cease to surprise me. Which is the point about a good evaluation. If you’re not getting surprised, you’re not doing it right.

Another way to put this is a plea for humility.

We politicians draw on a lot of excellent advice and research from bureaucrats, academics, think tanks, peak bodies and others to design good policies.

After all that advice and research, after all the debate and consultations, it’s easy to be seduced into believing ‘this will work’.

Let me give you an example.

Ten job training programs, how many work?

A study published last year analysed ten different job training programs in the United States (Juras, Gardiner, Peck, and Buron 2022). Doubtless the people who designed every one of these programs were confident that it would work.

There is plenty of evidence about the types of training people might need, the way to deliver the training, who should deliver the training, and so on.

The ten sets of program designers would have had access to thousands of relevant research studies and the very latest data visualisation tools. I expect that they were thoughtful, altruistic and enthusiastic when they designed their programs.

There is no reason to think that the people who designed these programs were any less smart and caring than the people in this room. Indeed, they probably knew more about job training than most of us.

And yet, designing a successful job training program is difficult.

The ten US programs I mentioned represented a range of job training strategies.

Each program was evaluated in a sizeable randomised trial tracking earnings over six years.

How many had a positive impact on earnings? Maybe you know where I’m going – perhaps five out of ten? Perhaps two out of ten?

Go lower. Only one, the YearUp program, had a positive, significant impact on earnings.

The good news is YearUp increased long-term earnings by over US$7,000 per year.

As the study points out, a lot needs to go right for a training program to boost earnings. It must have a sufficient impact on the credentials earned, those credentials must have labour market value, and the participants must find jobs.

Training programs can fail because participants don’t complete their studies, because the credentials have low economic returns, or because participants don’t move into employment.

We need rigorous evaluation not because program designers are foolish or careless, but because the problem is really, really difficult.

Even the best sounding program can turn out to be ineffective.

This pattern is typical – it shows we need to be humble about how effective our pet policies are likely to be.

If you’re a person who works on job training programs, it may reassure you to know that for every ten pharmaceutical treatments that enter clinical trials, only one makes it on to the market (Leigh 2018, p26). Nine out of ten medical drugs that looked promising in the lab fail to make it through the three stages of clinical trials.

Just as we do in health, social policy experts need to be honest about removing or redesigning the ineffective programs so funding can be directed to others, like YearUp, that pass the test.

The fundamentals of good impact evaluation

A feature of these and other randomised trials is they answer the question ‘did the program work?’ by trying to determine what would have happened if the program was not implemented.

This isn’t always obvious.

Perhaps the patients would’ve got better anyway.

Perhaps the students whose parents enrolled them in tutoring would’ve done well at school anyway.

Or perhaps a job training program appeared to be unsuccessful due to weak economic conditions but actually helped more people find jobs than if they hadn’t received the training.

How can we know what would have happened in any of these scenarios?

The counterfactual

In other words, how can we know the ‘counterfactual’?

The appeal of randomised trials is that it’s a simple way to construct a credible counterfactual.

Suppose we decided to test the impact of caffeine on concentration by doing an experiment with everyone in this room.

If we tossed a coin for each person, we would end up with roughly half in the heads group, and half in the tails group.

Then comes the experiment. When you get to the barista, she discreetly tosses a coin. Heads, she gives you a regular caffeinated coffee. Tails, she gives you a decaf coffee.

Ethics complaint letters can be addressed to Dr Parkinson.

The two groups – heads and tails – won’t be identical but they’ll be very similar.

This means that, if we found those in the caffeine group were able to concentrate for longer, it would be reasonable to conclude that, on average, caffeine is performance enhancing – at least for people similar to those in this room.

What would have happened without randomisation? If we’d just compared those who chose coffee with those who chose decaf, then the ‘selection effects’ would have mucked up the experiment. Perhaps coffee drinkers were the kinds of people who prioritised alertness. Or maybe coffee drinkers were extra tired after a big night. Without a credible counterfactual, the observational data would not have told us the true effect of caffeine on performance.

The problem with observational studies isn’t just an academic curio. In medicine, researchers using observational data had long observed that moderate alcohol drinkers tended to be healthier than non-drinkers or heavy drinkers. This led many doctors to advise their patients that a drink a day might be good for your health.

Yet the latest meta-analyses, published in the Journal of the American Medical Association, now conclude that this was a selection effect (Zhao et al 2023). In some studies, the population of non-drinkers included former alcoholics who have gone sober. Compared with non-drinkers, light drinkers are healthier on many dimensions, including weight, exercise and diet. Studies that use random differences in genetic predisposition to alcohol find no evidence that light drinking is good for your health (Biddinger et al 2022). A daily alcoholic beverage isn’t the worst thing you can do, but it’s not extending your life.

The problem extends to just about every study you’ve ever read that compares outcomes for people who choose to consume one kind of food or beverage with those who make different consumption choices. Health writers Peter Attia and Bill Gifford point out that ‘our food choices and eating habits are unfathomably complex’, so observational studies are almost always ‘hopelessly confounded’ (Attia and Gifford 2023, p300).

A better approach is that adopted by the US National Institutes of Health, which is conducting randomised nutrition studies. These require volunteers to live in a dormitory-style setting, where their diets are randomly changed from week to week. Nutritional randomised trials are costlier than nutritional epidemiology, but they have one big advantage: we can believe the results. They inform us about causal impacts, not mere correlations.

Making it credible

We know a credible counterfactual matters because researchers have used well‑conducted randomised trials as a yardstick and then compared their results with those from non-randomised evaluations.

Regrettably, they often don’t match.

For example, non‑randomised studies suggested free distribution of home computers had large positive effects on school students’ test scores – but randomised evaluations showed they in fact had little benefit (Bulman and Fairlie 2016).

It isn’t always ethical and feasible to randomly allocate government policies in this way. You can’t randomly allocate some people to receive a defence or foreign policy, and others not.

But you might be surprised how often it is possible.

There are numerous international examples of policies or programs that have been subject to randomised trials:

Housing subsidies for homeless people (Gillespie et al 2021, Hanson and Gillespie 2021, Cunningham et al 2021)
Crime prevention, policing methods, and restorative justice programs (Leigh 2018, Ch6)
Weight loss programs (Ahern et al 2022)
Microfinance programs (Schaberg et al 2022, Banerjee and Duflo 2011, Ch7)
Parenting programs, preschool and after-school care (Leigh 2018, Ch5)

Random allocation isn’t the only way to construct a credible counterfactual.

We can also use a range of ‘quasi-experimental methods’ and in my past life as an academic, I’ve used some of these research methods.

However, these methods have more tricks and traps that make them hard to do well and, even when done well, their results can be harder to communicate to policymakers and the public.

So, whenever we want to know what works, we should look first to randomised trials.

What works… and beyond

Any policymaker will want to know the answer to ‘Did it work?’

But while that question may be central, it’s far from the only evaluation question of interest.

Evaluation efforts can be directed towards finding out whether a program reached the target population or whether it was delivered as intended, on time and on budget.

This work is important for the process of continuous improvement.

It often occurs as new programs are being refined and improved.

And it’s also crucial for how we interpret results of randomised trials.

A textbook example

For example, four studies found students in developing countries who were randomly assigned to receive textbooks did no better on standardised tests than students without textbooks (Glewwe and Muralidharan 2016, pp39-40 and 78-79).

A naïve response might be to simply conclude ‘textbooks do not matter’.

But it’s important to unpack the findings.

In the first study, schools put the textbooks in storage rather than delivering them to the classrooms.

In a second study, free textbooks led parents to reduce the amount they spent on their children’s education.

In the third study, teachers did not incorporate textbooks into their teaching.

The final study found textbooks did help the top students, but not the remainder, who were unable to read.

Understanding the context of the studies helps researchers and policymakers form a fuller picture of the intervention.

The need for better evaluation of government programs

Calls for better evaluation of government policies and programs have been growing louder.

The 2019 Independent Review of the APS chaired by David Thodey said ‘the APS needs to reverse the long-term decline in research and evaluation expertise and build integrated policy capability’ (Thodey et al 2019 p220).

Research commissioned for the review found the Australian Public Service’s ‘approach to evaluation is piecemeal in both scope and quality, and that this diminishes accountability and is a significant barrier to evidence-based policy-making’ (Bray, Gray and ‘t Hart 2019 p8).

Findings were similar in the 2022 Independent Review into Australia’s response to COVID-19 led by Peter Shergold.

It said, ‘existing evaluation efforts are typically piecemeal and low-quality and rarely translate into better policymaking’ (Shergold, Broadbent, Marshall, Varghese 2022, p71).

The 2023 Disrupting Disadvantage Report published by CEDA said ‘without consistent program evaluation and implementing improvements based on data, evidence and analysis, ineffective programs are allowed to continue even as effective programs are stopped’ (CEDA 2023 p8).

CEDA said they ‘examined a sample of 20 Federal Government programs with a total expenditure of more than $200 billion. Ninety-five per cent of these programs were found not to have been properly evaluated’ (CEDA 2023 p8).

The Productivity Commission has made recommendations about the need for evaluation in several reports.

Its 2020 Indigenous Evaluation Strategy paper said, ‘the reality is that evidence about what works and why remains thin’ (PC 2023 p2).

The Interim Economic Inclusion Advisory Committee’s recent report recommended evaluations ‘including randomised control trials and effective use of administrative data’. It suggested that ‘funding should be re-allocated from things that do not work to things that do, so that approaches that are found to deliver the best outcomes can be scaled up’ (EIAC 2023 p9).

All strong arguments from a wide range of respected bodies.

The common theme from all these reports is the Government should commission more and better-quality evaluations, be more transparent about evaluation findings, and make better use of evaluation evidence.

A new evaluation function for the Australian Public Service

The Albanese Government agrees.

We committed to establishing an evaluation unit during the last election and we’ve delivered on our promise in the recent Budget.

We announced $10 million in funding over four years to establish a central evaluation function within Treasury.

It will be known as the Australian Centre for Evaluation.

As you can imagine, given my earlier comments, a core role for the Centre will be to champion randomised trials and other high‑quality impact evaluations.

It will partner with government agencies to initiate a small number of high‑quality impact evaluations each year.

These evaluations will help to build momentum by demonstrating the value of high-quality evaluation methods, sharing lessons across the Australian Government, and building agencies’ evaluation capabilities.

Most evaluation activities will, however, continue to be conducted or commissioned by agencies.

Consequently, the Australian Centre for Evaluation will also support and partner with agencies to build their capabilities and prepare their own evaluations, multiplying the volume and quality of evaluations APS-wide.

Over the long term, the Centre will do much more than this.

As part of the Albanese Government’s APS Reform Plan, the Centre’s remit is to embed an enduring culture of evaluation – in all its forms – across the Australian Government.

This will involve further work to support the adoption of the Commonwealth Evaluation Policy and Toolkit, building on the efforts to date led by the Department of Finance.

The Centre will also oversee efforts to build evaluation capability across the Australian Public Service.

This includes leadership of the new Commonwealth Evaluation Community of Practice, which has already grown to nearly 400 members since it was launched in September last year.

Finally, the Centre will play a critical role in the budget process, reviewing the evaluation plans and use of evaluation evidence for selected budget proposals.

The Centre’s approach will be underpinned by strong collaborative partnerships. Unlike an auditor-style role, the Australian Centre for Evaluation will be a trusted adviser and partner.

It will work with agencies across the Australian Government to rebuild in-house evaluation expertise and reduce the reliance on external providers.

This collaborative approach will also involve working closely with states, territories, not-for-profits, academia, and international partners.

Closing remarks

As we bunker down for winter, let me leave you with one final example of a randomised trial and the powerful lessons governments and policymakers can learn.

It involves the Victorian Government’s Healthy Homes Program (Victorian Government 2022).

It was a free program rolled out over three years from 2018 to 2020 to upgrade the energy efficiency of 1,000 homes of low-income Victorians with a health or social care need.

The aim of the Healthy Homes Program was to improve the health outcomes of vulnerable people by spending around $2,800 on making their homes warmer.

Upgrade options included draught sealing, installing insulation, replacing or servicing existing heaters, and improving window furnishings.

The trial found the Healthy Homes Program increased average indoor temperature by 0.33 degrees and exposure to temperatures below 18 degrees was reduced by 43 minutes per day.

It may sound minor but by making homes warmer and more energy efficient, the Healthy Homes Program led to savings of $887 per person in the healthcare system and $85 per person in energy costs over the winter months.

In other words, the upgrade paid for itself within three years, and the full program costs (including administration costs) would be paid back in less than seven years.

We want to see more trials like this.

As Treasurer Jim Chalmers and Finance Minister Katy Gallagher have noted, it’s vital we evaluate more programs – creating a better evidence base about what works, and what doesn’t.

The Albanese Government is committed to improving the quality of evaluations across government.

Thank you and enjoy your coffee break.

References

Ahern A et al (2022) ‘Effectiveness and cost-effectiveness of referral to a commercial open group behavioural weight management programme in adults with overweight and obesity: 5-year follow-up of the WRAP randomised controlled trial’, Lancet Public Health 7(10): e866-e875.

Attia P and Gifford B (2023) Outlive: The Science and Art of Longevity, Harmony, New York.

Banerjee, A and Duflo E (2011) Poor economics: a radical rethinking of the way to fight global poverty, Public Affairs Press, New York.

Biddinger K, Emdin C, Haas M, Wang M, Hindy G, Ellinor P, Kathiresan S, Khera A, Aragam K (2022) ‘Association of Habitual Alcohol Intake With Risk of Cardiovascular Disease’ JAMA Network Open, 5(3), e223849-e223849.

Bray R, Gray M, t’Hart P (2019) ‘Evaluation and learning from failure and success’ An ANZSOG research paper for the Australian Public Service Review Panel.

Bulman G, Fairlie R (2016) ‘Technology and education: The effects of computers, the Internet and computer assisted instruction on educational outcomes’ in Eric A. Hanushek, Stephen Machin and Ludger Woessmann (eds), Handbook of the Economics of Education, Volume 5, Amsterdam: Elsevier, pp. 239–80.

Committee for Economic Development of Australia (CEDA) (2023) ‘Disrupting Disadvantage’, Committee for Economic Development of Australia, Melbourne.

Crowe K (2018) ‘University of Twitter? Scientists give impromptu lecture critiquing nutrition research’, CBC News, 5 May.

Cunningham M, Hanson D, Gillespie S, Pergamit M, Alyse A, Spauster P, O’Brien T, Sweitzer L (2021) ‘Breaking the Homelessness-Jail Cycle with Housing First’ Urban Institute, Washington DC.

Gillespie S, Hanson D, Leopold J, Oneto A (2021) ‘Costs and Offsets of Providing Supportive Housing to Break the Homelessness-Jail Cycle’ Urban Institute, Washington DC.

Glewwe P, Muralidharan K (2016) Improving Education Outcomes in Developing Countries— Evidence, Knowledge Gaps, and Policy Implications. In Handbook of the Economics of Education, Vol. 5, edited by E. Hanushek, S. Machin, and L. Woessman, 653–744. Amsterdam, North Holland.

Hanson D, S Gillespie (2021) ‘Improving Health Care through Housing First’ Urban Institute, Washington DC.

Interim Economic Inclusion Advisory Committee (EIAC) (2023) ‘2023-24 Report to the Australian Government’, Australian Treasury, Canberra.

Juras R, Gardiner K, Peck L, Buron L (2022) ‘Summary and Insights from the Long-Term Follow-Up of Ten PACE and HPOG 1.0 Job Training Evaluations: Six-Year Cross-Site Report’, US. Department of Health and Human Services, Washington DC.

Leigh A (2022) ‘Electric cars make the weekend more fun. Canberrans have known this for decades’, Canberra Times, 22 September.

Leigh A (2018) Randomistas: How Radical Researchers Changed Our World, Black Inc, Melbourne.

Productivity Commission (PC) (2020) ‘Indigenous Evaluation Strategy’ Productivity Commission, Melbourne.

Schaberg K, Holman D, Quiroz Becerra V, Hendra R (2022) ‘Pathways to Financial Resilience: 36-Month Impacts of the Grameen America Program’, MDRC, New York.

Shergold P, Broadbent J, Marshall I, Varghese P (2022) Independent Review into Australia’s response to COVID-19 ‘Fault Lines - Independent Review into Australia’s response to COVID-19’, e61 Institute, Sydney.

Simpson M (2012) ‘1942 home-made electric car’ NSW Government, Museum of Applied Arts & Sciences Object No. B2339 [object summary, production notes and history], Sydney.

Sir Roland Wilson Foundation (2023) ‘Sir Roland Wilson Foundation 2023 Alumni Impact Report’.

Thodey D, Davis G, Hutchinson B, Carnegie M, de Brouwer G, Watkins A (2019) ‘Our Public Service, Our Future, Independent Review of the Australian Public Service’ Department of the Prime Minister and Cabinet, Canberra.

Treasury (2001) ‘The Centenary of Treasury – 100 years of Public Service’, Australian Treasury, Canberra.

Victorian Government (2022) ‘The Victorian Healthy Homes Program – research findings’, Victorian Government, Melbourne.

Zhao J, Stockwell T, Naimi T, Churchill S, Clay J, Sherk A (2023) ‘Association Between Daily Alcohol Intake and Risk of All-Cause Mortality: A Systematic Review and Meta-analyses’ JAMA Network Open, 6(3), e236185-e236185.