29 September 2023

Address to the Australian Evaluation Society 2023 International Evaluation Conference, Brisbane


The best medicine: lessons from health for policy randomistas

I acknowledge the traditional owners, the Turrbal and Jagera peoples, recognise all First Nations people present, and commit myself to the implementation of the Uluru Statement of the Heart. This starts with voting Yes on October 14 for a First Nations Voice to Parliament.

I note too that the Australian Evaluation Society supports the Voice to Parliament. In the powerful words of AES President Kiri Parata: ‘It's been too long that Aboriginal and Torres Strait Islander Peoples have been calling for their voices to be heard. First Nations Australians deserve better and their voices matter. I’m proud that the AES has elevated the voices of First Nations Peoples and will continue its commitment to quality evaluation that impacts positively for First Nations communities. Constitutional recognition will benefit all Australians as the flow on effects of this much anticipated change are realised.’

This is not a new focus for the Australian Evaluation Society. In preparing my talk, I went back to the earliest annual report on your website, from twenty years ago. It showed that in the 2002‑03 year, the AES set three priority areas, one of which was ‘Indigenous Evaluation’. Anona Armstrong, who founded the AES in 1982, has spoken about the importance of Indigenous evaluation to the Society.

I’m pleased to be speaking with you today about the Australian Centre for Evaluation, established in the 2023 Budget. My goal is to give you a sense of the philosophy underpinning the centre, how we intend it to operate and collaborate, and our hope for how it will contribute to the evaluation landscape in Australia.

Healthy Lessons

The best way to sum up our approach to policy evaluation is that we’re trying to take lessons from health, and in particular how rigorous evaluation has changed medical practices, saving money and lives. Although hospital dramas like Grey’s Anatomy and House are centred around star doctors, the real story of medicine today is one of good evidence driving out bad theories.

No area of medicine epitomises this shift better than the radical mastectomy.

In the 1880s, US surgeon William Halsted formed the view that breast cancer was most effectively treated by excising large portions of the patient’s tissue. Previous operations, he argued, had been too timid. Observing that patients often relapsed after surgery that removed only the tumour, Halsted advocated removing considerable amounts of surrounding tissue.

Halsted’s surgery removed the pectoralis major, the muscle that moves the shoulder and hand. He called it the ‘radical mastectomy’, drawing on the Latin meaning of radical to mean ‘root’. In The Emperor of All Maladies, Siddhartha Mukherjee describes how Halsted and his students took the procedure further and further. They began to cut into the chest. Through the collarbone. Into the neck. Some removed ribs. They sought out the lymph nodes, and claimed they had ‘cleaned out’ the cancer.

Women who endured these operations were left permanently disfigured, often with gaping holes in their chests. In some cases, they were unable to properly move an arm. In other instances, their shoulders permanently hunched forwards. Recovery could take years. But Halsted was unrepentant, referring to less aggressive surgery as ‘mistaken kindness’.

Halsted persuaded others not through his data, which was shaky, but through the force of his rhetoric and personality. He was supremely self‑confident, perhaps fuelled through his cocaine addiction, and belittled his critics for their faint heartedness. Radical mastectomies, he acknowledged, would disfigure patients. But these war wounds were the price of winning the battle.

Yet whether a patient survived breast cancer depended not on how much tissue was removed, but whether the cancer had metastasized and spread through her body. If it had not metastasized, a more precise operation to remove the cancer would have been just as effective. If it had metastasized, then a radical mastectomy would still fail to remove it.

In 1967, Bernard Fisher became chair of the National Surgical Adjuvant Breast Project at the University of Pittsburgh School of Medicine. Fisher was struck by the lack of evidence supporting the radical mastectomy. He became interested in the mystery of metastasis and the growing use of clinical trials in medicine. No matter how venerable the clinician, he argued, experience was no substitute for evidence. So Fisher began recruiting for patients to take place in a clinical trial that would test the impact of the radical mastectomy by comparing the surgery against a more moderate alternative, the lumpectomy, that involved removing only the cancerous tissue.

Fisher’s randomised trial faced major hurdles. The first was to persuade women to participate in a trial in which randomisation would determine whether the surgeon would remove a lump or their entire breast. The second was to persuade surgeons to refer patients to the trial. Having been trained in radical surgery, many surgeons felt that a lumpectomy was unethical and were hostile to the trial. They flatly refused to refer patients to a trial that might see them receiving anything other than a radical mastectomy. But Fisher was helped by the burgeoning feminist movement. As activist Cynthia Pearson noted, the ‘women's health movement began talking about mastectomy as one of the examples of sexism in medical care in the United States’.

In the face of opposition in the United States, Fisher’s breast cancer surgery trial had to be expanded to Canada to get sufficient sample size. Eventually, it covered 1,765 patients, who were randomised into three groups – radical mastectomy, simple mastectomy and surgery followed by radiation. The results were finally published in 1981. They showed that there were no differences in mortality between the three groups. The women who had undergone radical mastectomies had suffered considerably from the surgery – yet they had not benefited in terms of survival. Fisher’s randomised trial changed how surgeons treat breast cancer, but it took a century. Between the 1880s and the 1980s, around half a million women underwent radical mastectomies, an unnecessary surgical treatment.

Your Choices Say a Lot About You

Randomised trials are valuable in instances where experts have strong views. In the case of breast cancer treatment, it took data to cut through ideology. An advantage of randomised trials is that they identify a clear counterfactual – what would have happened without the intervention. This can be especially important in instances where people self‑select into different treatments.

To see this, suppose that we wished to conduct an experiment on the impact of caffeine on whether people stay awake during a talk by a politician on a Friday afternoon at the end of a stimulating three‑day conference.

A randomised trial of this kind might involve a barista producing both regular coffees and decaf coffees. We might pick the coffee blends so that decaf and regular taste as similar as possible.

Each time a person walks up to the coffee stand, the barista tosses a coin. Heads, you get a regular coffee. Tails, you get a decaf coffee. The law of large numbers tells us that if we did this experiment with everyone in the room, we would end up with roughly half in the heads group, and half in the tails group.

Before taking a sip of the coffee, the heads and tails groups would be similar in every way. With a large number of people, we can reasonably expect that the groups will include a similar number of men and women, a similar number of junior and senior researchers, a similar number of morning larks and night owls.

As a result, if we observe differences in alertness between the two groups, then we know that it must be due to the caffeine. We could conclude from this that caffeine keeps people awake, at least for the kinds of people in this room.

What would have happened without randomisation? What if we allowed everyone to ask the barista for regular or decaf, and then tracked the alertness of both groups? How might an observational study turn out differently?

In this case, an observational study would be plagued by selection effects. Those who chose the caffeinated drink might have been the kinds of people who prioritised alertness. Or maybe caffeine consumers were extra tired after a big night. Without a credible counterfactual, the observational data would not have told us the true effect of caffeine on performance. We would have learned a lot about the kinds of people who chose caffeinated drinks – but very little about the true impact of caffeine.

The problem with observational studies isn’t just an academic curio. In medicine, researchers using observational data had long observed that moderate alcohol drinkers tended to be healthier than non‑drinkers or heavy drinkers. This led many doctors to advise their patients that a drink a day might be good for your health.

Yet the latest meta‑analyses, published in the Journal of the American Medical Association, now conclude that this was a selection effect (Zhao et al 2023). In some studies, the population of non‑drinkers included former alcoholics who have gone sober. Compared with non‑drinkers, light drinkers are healthier on many dimensions, including weight, exercise and diet. Studies that use random differences in genetic predisposition to alcohol find no evidence that light drinking is good for your health (Biddinger et al 2022). A daily alcoholic beverage isn’t the worst thing you can do, but it’s not extending your life.

The problem extends to just about every study you’ve ever read that compares outcomes for people who choose to consume one kind of food or beverage with those who make different consumption choices. Health writers Peter Attia and Bill Gifford point out that ‘our food choices and eating habits are unfathomably complex’, so observational studies are almost always ‘hopelessly confounded’ (Attia and Gifford 2023, p300).

A better approach is that adopted by the US National Institutes of Health, which is conducting randomised nutrition studies. These require volunteers to live in a dormitory‑style setting, where their diets are randomly changed from week to week. Nutritional randomised trials are costlier than nutritional epidemiology, but they have one big advantage: we can believe the results. They inform us about causal impacts, not mere correlations.

Indeed, a clever experiment with mice has shown how problematic nutritional epidemiology can be. The study, led by Keisuke Ejima (Ejima et al 2016), starts off with a randomised experiment on calorie restriction and longevity. By randomly varying the amount of calories given to different groups of mice, the researchers show that mice that are fed a calorie‑restricted diet tend to live longer. This result replicated a well‑established finding, that calorie restriction boosts longevity.

Next, the researchers looked within those mice that had been allowed to eat as much as they wanted. Within that group, what was the association between calorie consumption and longevity? Now, the result flipped. Those mice that ate more calories lived longer – perhaps because they were doing more exercise or had faster metabolisms.[i]

Bottom line: the observational study produced exactly the wrong result. As Peter Attia and Bill Gifford’s work has observed, the complexity of what we choose to eat can confound any studies about the true effect of food and health.

The Australian Centre for Evaluation

In establishing the Australian Centre for Evaluation, we won’t only conduct randomised trials. But randomised trials will be an important component of the work of the centre, which is why I’ve focused my remarks on them until this point.

A few basics about the centre. The Australian Centre for Evaluation will be located in the Australian Treasury, and will partner with other government agencies to conduct rigorous evaluations. The Centre receives funding of around $2 million per year, and employs around 14 staff. Its work will be conducted within a careful ethical framework, ensuring that we are as rigorous about issues of ethics as about issues of causality.

The Australian Centre for Evaluation will not conduct all the evaluations in government. Given its size, that would be impossible. At present, the volume of external evaluations is over $50 million a year, and many agencies have their own in‑house evaluation teams. The Australian Centre for Evaluation will partner on a modest number of flagship evaluations, and work to build capacity across the public service on rigorous evaluation.

An important part of the process will be to ensure that evaluations are not unnecessarily expensive. Over recent years, the response rate in government surveys has fallen, while the quality of administrative data has risen. In this environment, we are looking for opportunities to see how evaluations can make better use of data that is already held by the government, while maintaining strict privacy protections. Randomised trials need not take decades and cost tens of millions of dollars. Low‑cost randomised trials can produce rapid insights at a modest cost.

The Australian Centre for Evaluation will be characterised by its openness. The choice of name is deliberate: we want this to be a centre for high‑quality evaluation nationally. This means that the Centre will be open to engaging as appropriate with states and territories, with non‑profits and philanthropic foundations, and with evaluation experts in the private sector and academia. We are all on the evaluation journey together, and engagement will help to build the nation’s evaluation capacity.

As part of that engagement, I am pleased to announce that the Australian Centre for Evaluation’s website is now live, at evaluation.treasury.gov.au. Feel free to now pull out your tablet or smartphone and ignore the remainder of my talk in favour of adding that site to your bookmarks, emailing it to your friends, and posting it on your social media channels. If ‘ACE website’ is trending today on the platform formerly known as Twitter, we will know that our work as evaluation nerds is done.


I began this talk by focusing on how randomised trials have helped transform medicine. While randomised trials of pharmaceuticals have been commonplace since the 1950s, randomised surgical trials are still relatively rare. Likewise, the advent of randomised nutritional trials is still in its infancy. In these fields and others, high quality evidence is helping to displace misplaced dogma. As the saying goes, ‘in God we trust, all others must bring data’.

Over coming years, I am curious in the extent to which developments in medicine may have applications in policy. Following Phase One safety trials, health researchers typically conduct two phases of clinical trials: Phase Two trials, on a relatively small population, and Phase Three trials, on a larger population. The notion of replicating the evaluation before going to market is highly relevant in social science. Over the past decade, the ‘replication crisis’ in social science has seen a number of high‑profile findings debunked. Single studies with surprising findings that are published in top journals have had too much impact on how we view the world. We need to do a better job of building in replication in social science – just as Phase Two and Phase Three trials do in the clinical world.

Another feature of the health evidence ecosystem that could be adopted in policy is the notion of the living evidence review. A decade ago, Julian Elliott, Professor of Evidence Synthesis at Cochrane Australia, developed the notion of a living evidence review – a systematic review that is updated in real‑time, as new studies are published. When COVID hit, Elliott worked with clinicians and researchers from around Australia to produce the National COVID‑19 Clinical Evidence Taskforce, a set of living evidence syntheses and recommendations that were updated every week to collate high‑quality evidence on everything from how best to treat new coronavirus variants, to the impact of masks on the risk of transmission (Global Commission on Evidence 2022). In policy, the availability of living evidence syntheses would help decision makers identify the most relevant research, and avoid the danger of being swayed by a single low‑quality study.

Strengthening the national evidence infrastructure, so that all programs and policies are rigorously assessed for their effectiveness and impact, and evaluation evidence is routinely synthesised and made publicly available, will take time. We have taken a critically important first step in establishing the Australian Centre for Evaluation and look forward to working with all of you to see high‑quality evaluation evidence placed at the heart of policy design and decision‑making.


Attia P and Gifford B (2023) Outlive: The Science and Art of Longevity, Harmony, New York.

Biddinger K, Emdin C, Haas M, Wang M, Hindy G, Ellinor P, Kathiresan S, Khera A, Aragam K (2022) ‘Association of Habitual Alcohol Intake With Risk of Cardiovascular Disease’ JAMA Network Open, 5(3), e223849‑e223849.

Ejima, K., Li, P., Smith Jr, D.L., Nagy, T.R., Kadish, I., van Groen, T., Dawson, J.A., Yang, Y., Patki, A. and Allison, D.B., 2016. Observational research rigour alone does not justify causal inference. European journal of clinical investigation, 46(12), pp.985‑993.

Global Commission on Evidence to Address Societal Challenges. 2022. The Evidence Commission report: A wake‑up call and path forward for decisionmakers, evidence intermediaries, and impact‑oriented evidence producers. Hamilton: McMaster Health Forum.

Mukherjee, S., 2010. The emperor of all maladies: a biography of cancer. Simon and Schuster, New York.

Zhao J, Stockwell T, Naimi T, Churchill S, Clay J, Sherk A (2023) ‘Association Between Daily Alcohol Intake and Risk of All‑Cause Mortality: A Systematic Review and Meta‑analyses’ JAMA Network Open, 6(3), e236185‑e236185.

[i] In case you’re in any doubt that the observational study wasn’t providing any insights, the researchers paired the free‑eating mice with another group. The diet given to these rodents was the amount of food that their paired mouse had chosen to eat the previous day. Among the paired mice, the association between calories and longevity disappeared.