Best-practice methods for stripping personally identifiable information (PII) out of statistical data are well established and informed by a large body of research going back decades. All government agencies really need to start following them.
One of the world’s foremost experts in the field of statistical de-identification, Dr Khaled El-Emam, says “the adoption of good practices is not very high” among organisations that release data publicly, or to a selected few for secondary uses that require de-identification in the absence of consent.
The problem is low adoption of the best methods; de-identification is widely seen as a simple, basic, routine process. El-Emam says practices need to move “into the real world” through stricter application of the existing standards along with training, certification and use of specialised software if necessary.
Those standards are not hard to find around the world. The Office of the Australian Information Commissioner has plenty of resources on the topic, which start from basic advice and include all the necessary signposts to the most detailed technical aspects of the craft.
“Over the last three or four years, there’s been a lot of activity in developing standards and guidelines from government organisations, from professional organisations, from academic institutions, [and] from regulators around the world on how to de-identify data using a risk-based approach,” El-Emam told the GovInnovate conference this week.
“And all of these guidelines are consistent with each other … there is no excuse.”
De-identification: it’s all about numbers
Best practice de-identification is quantitative. “You can measure the risk of re-identification and once you’re able to measure it, you can manage it,” says El-Emam.
The methods he offered to public servants in Canberra involve long-established risk thresholds, expressed as the probability that the data will be re-identified.
“So there are different organisations around the world that have defined and used certain thresholds for releasing data,” he says. “We have very strong precedents going back a long time, many decades, for what’s acceptable risk. So this covers the full range, from public data for a very risk-averse organisation to non-public data releases.”
A commonly used threshold for public data releases is a probability of 0.09 that re-identification will be achieved, and “there are good methodologies for choosing a value that fits within these thresholds,” according to El Emam.
He says in this case the dataset could have a maximum of 8 quasi-identifiers — information that is not quite PII but can still help identify a person from their record — and contain no longitudinal data. El-Emam later said longitudinal data could be released publicly but in his experience, the amount he would modify the data with “perturbation” techniques would make it fairly useless.
Another option is to commission a “motivated intruder test” to get empirical information about how easy it is to re-identify people from the data.
“And the second key thing is that it’s context-based. So you can de-identify a dataset that you post online on the web for anyone to download, versus a dataset that you make available through a … secure enclave; you will de-identify these two versions of the same dataset very differently.”
And thresholds can of course be set anywhere between the two extremes of secure enclaves and open data websites. The same dataset can also be released in multiple forms with more privacy and information security controls for richer, riskier, versions.
For open public release, the only option to reduce the privacy risk is the application of cryptographic methods of data perturbation, which also significantly reduce the utility of the dataset, El Emam explains.
Its limited usefulness then raises the question of how far governments should push the privacy envelope with open data, and why some sceptics have already decided the whole idea is playing with fire for little actual benefit.
Continue reading at The Mandarin: Sceptics respond: ‘It’s a myth that we have an algorithm that works’