De-identification is seen as simple: ‘what’s the worst that can happen?’

By Stephen Easton

Friday November 18, 2016

Beagle dog wearing blue flying glasses or goggels, sitting in a bicycle basket on a sunny day

Even following accepted international best practice for data de-identification won’t satisfy some privacy advocates and cybersecurity experts, who worry that all open data is already an unacceptable threat to individual citizens. But the big data train has already left the station. Taking de-identification more seriously and following the guidelines will be increasingly important as it picks up steam.

Dr Khaled El-Emam, whose anonymisation methods satisfy the most stringent rules for the release of healthcare information in the United States and Canada, does not believe the risk of identification outweighs the potential benefits of the big data era.

“The key thing about re-identification attacks is … almost all of them occurred in situations where data was not de-identified properly,” El-Emam told the GovInnovate conference this week.

“All known re-identification attacks were done by academics and the media,” he added — though open data critics would emphasise “known” in this sentence.

Australian government agencies at all levels have routinely de-identified statistical data for years, to varying standards, but they might need to look again at their process.

Information and Privacy Commissioner Timothy Pilgrim says risk-based de-identification of public open data is a bit like sending people to the moon. It’s a simple enough concept to understand, but actually quite complicated in practice, and there’s a chance of fiery explosions if mistakes are made.

And while academics from the field like El-Emam can point to reams of academic literature backing up risk-based processes, the commissioner disagrees that there is a strong consensus in the community on what “getting it right” means.

Both agree that de-identification is rapidly becoming more challenging. The probability of datasets being re-identified is generally growing alongside their complexity. And the common practice of linking datasets together adds new risks to previously de-identified sets.

What’s the worst that has happened?

Two notorious examples from the US served to show the consequences of poorly de-identified personal records being linked back to individuals and combined with information from other public sources, including news articles.

In one case, details of a year’s worth of New York City taxi rides and all taxi driver license numbers were released in response to a freedom of information request and quickly re-identified.

“The medallion number is essentially a unique identifier for the taxi driver, and it was pseudonymised using an MD5 hash without a salt — so it was a very simple protection scheme, that was poorly done,” El-Emam explains.

The identity of all the drivers and trips they had taken over a year made it fairly easy to work out plenty of other details and cross-reference with other facts in the public domain. For high-profile passengers, that’s a lot of information.

“So they were able to track the movements of actors and actresses and politicians and so on by looking at pictures of them entering cabs,” El-Emam explained. “That didn’t reveal anything interesting, except that actors really tip very poorly.”

A few politicians were also caught red-faced leaving various houses of ill repute at unseemly hours of the morning.

One poor decision can risk the policy evidence-base

In another famous re-identification attack, a researcher showed Washington State’s hospital discharge data — which was routinely shared for a fee along with that of other states — was also very poorly de-identified.

“She looked at newspaper articles about vehicle accidents and domestic disturbances and then matched the information in the stories with the hospital discharge database,” he said, showing an example in which the news story contained a staggering 13 points of personal information about the crash victim.

“And, you know, if I know 13 things about you, you’re toast.”

The consequences were more serious this time, in El-Emam’s view.

“So what happened after this of course is … many of the states in the US stopped sharing the hospital discharge data, which had a huge negative impact on public health research in the US.

“When states like California, which had the largest database, stopped sharing their data after this, it had a huge negative impact and negative consequences.”

‘It’s a myth that we have an algorithm that works’

Cryptographer Vanessa Teague — who recently helped put the danger of re-identification on the radar of Canberra’s bureaucracy by uncovering Medicare service provider ID codes from a release — joined the panel as one such sceptic.

She challenged El-Emam to offer one cryptographic algorithm that he could say definitely works to protect publicly released data “without completely destroying” the information.

“I don’t see a method that’s been peer-reviewed and extensively scrutinised that gives us confidence that we could put out Census data or medical data or Centrelink data for total public release, without completely destroying the information content of that dataset,” she said. “And, be confident that individuals couldn’t be re-identified based on their information.”

“So I’m going to put it out there … it’s a myth that we have an algorithm that works. I’m willing to be convinced by evidence, but I haven’t seen the evidence at this point.”

El-Emam took up the challenge, insisting that the “large body of work in this area” over 40 years had built up knowledge of what does and does not work.  Asked how he would secure the Medicare data Teague re-identified recently, for re-release, he referred her to the many de-identification guidelines and launched into a brief summary of the probability based risk thresholds and assessment processes.

“So you choose a value, set a threshold, measure your risk, demonstrate your risk is below the threshold — and there [are] dozens of algorithms that you can use to modify your data to ensure that your measured risk is below the threshold,” El-Emam said.

Teague challenged him to name “one algorithm that works” and El-Emam put forward “optimal lattice anonymisation” — adding there were several more of the same structure and “tons” that allow “k-anonymity” to be used with public data releases.

Making sense of such arguments is a struggle for the layperson, and on top of that, there is a strong sense that academics from these overlapping fields, public servants, the public and other stakeholders are not on the same page.

Adversarial attack-based credibility

“I come from a cryptography background,” said Teague, introducing herself and explaining some cryptographers create algorithms and others try to break them, or their own.

“This is how we decide that things are secure,” she said. “So the whole community runs by this very open, actually quite friendly process, in which people put their algorithms up for public scrutiny, and others try and explain what’s wrong with them and poke holes in them, figure out how to reverse their encryption … or forge the signatures or whatever.

“This is the way we figure out what’s secure.”

RSA encryption has stood for over 30 years but there is no “formal mathematical proof” of its security, only the fact that it has been “totally open” to anyone to have a crack.

Development of the recent Advanced Encryption Standard was also a “totally open process” in which “every cryptographer in the world had a go at breaking them before they were standardised”.

Efforts were then even more focused on cracking the winning algorithm and “we can be fairly confident now that the good guys are ahead of the bad guys” according to Teague. But she had less kind words for El-Emam’s field.

“When I look at the de-identification space, mostly as an outsider, I don’t see that kind of mature, free and open, scientific and mathematical discussion,” said Teague.

“I don’t see public, peer-reviewed algorithms with a clear explanation of why they work on particular datasets. What I see is some people from the crypto community breaking some things, thinking this is fun, and this is part of the way that we search for truth and we prove things.”

She said people in her field were shocked to find their helpful hacking — which follows standard academic guidelines and ethics of its own — was often unwelcome.

“To me, that’s exactly the way it’s supposed to be, because that’s the way that we figure out how to improve algorithms, by understanding and acknowledging what falls apart.”

“So I’m very sceptical that we even have an algorithm that can allow us, safely, to de-identify for public release, the kind of sensitive datasets that I see the Australian government publishing, or intending to publish in the future.”

Risk versus rewards

Stephen Hardy, Data61’s technology director for data analytics, walks more of a middle line, arguing de-identification should not be seen as “absolute” as the risk of re-identification depends on the capabilities of the attacker, and how much other identifiable information they already hold.

“If you’re releasing publicly then you have to try and think of all the possible attacks — obviously it’s impossible but you do as much as possible — and there are certainly lots of methods that are safe, and have been shown to be safe, for public data release,” he says.

“I wouldn’t claim otherwise. It’s more, as we push this boundary to try and release more data, because it’s useful … we’re trying to push more information out and that information is more re-identifiable by definition.

“Therefore you would do well to consider who is it OK to be able to re-identify this data. If an organisation does re-identify the data, are they just learning something they already know?”

If the prize for a potential adversary is worth them investing a lot in re-identifying the data, the risk is higher.

Hardy also acknowledged that at the heart of the disagreement about whether de-identification methods are reliable is a clash of two very different academic cultures.

“I actually read both literatures,” he later broke in to El-Emam and Teague’s exchange. “The crypto community has a much stronger adversarial focus.”

“On the flipside, if you look at the de-identification literature, there’s always this question of what is the person trying to achieve?”

Commissioning a “motivated intruder attack” is as adversarial as the OAIC advice on “assessing the risks of re-identification” gets. It also suggests finding someone with similar expertise to El-Emam:

“Depending on the outcome of the risk analysis and the de-identification process, information and data custodians may need to engage an expert to undertake a statistical or scientific assessment of the information asset to ensure the risk of re-identification is low.”

Public servants ‘lulled into a false sense of security’

The Australian Bureau of Statistics general manager of strategy and partnerships, Gemma Van Halderan, said she felt people were not “all talking about the same thing” when they used words like “secrecy, anonymisation, pseudonymisation” which had specific legal meanings in the public service.

But another speaker on the panel, Salinger Privacy director Anna Johnston, has little confidence that public service leaders — or government lawyers — understand what they are doing when they accept assurances that open data is safe from IT experts.

“The risk of legal non-compliance because, the lawyer is not talking same language as the engineer or the IT person when they talk about de-identification,” she said, giving two examples.

Johnston said data regularly goes into the New South Wales Data Analytics Centre containing “hundreds of thousands of records” with only the names stripped out, leaving other details such as home addresses, that could enable easy re-identification.

She sees “a risk that agencies hand over this data, thinking it’s been de-identified” to the point of legal compliance but actually end up breaching the law when it is later re-identified.

“I’ve also seen legal advice given to government agencies from big-name law firms, saying that replacing names with statistical linkage keys like SLK581 … renders the information de-identified to the point that it no longer meets the definition of personal information and therefore all the privacy rules don’t apply.”

Johnston disagrees strongly with that advice and thinks public servants are being “lulled into a false sense of security” by those who fail to see the problem the growing risk of re-identification represents.

About the author
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
5 years ago

Publication of de-identified personal information should be considered to be weakly and temporarily protected.

Many examples will eventually be breached and re-identified as machine learning tools and unprotected data set proliferation in the wild erode even the best current identification methods, as the recent NICTA/ANU/UNSW research by Rizoiu et al makes clear.

See for example, a machine learning exercise extracting individual traits out of a very effectively de-identified set:

Marian-Andrei Rizoiu et al, ‘Evolution of Privacy Loss in Wikipedia’, Proceedings of Ninth ACM International Conference on Web Search and Data Mining, SF Feb 2016, pp 215-224,

(See also the recent instant breaches of the 10% Medicare sample and the
APSC census data sets by Aussie researchers, methods unknown.)

The essential resource for effective
public sector professionals