Victorian privacy commissioner Rachel Dixon has commissioned critical advice for public servants from three Melbourne academics who have worked hard to raise awareness of the risk that individuals can be re-identified from open data publications.
Not so long ago, few public servants questioned the received wisdom that large sets of data about individuals were safe to release as long as enough personal information was removed. That has all changed in recent years, thanks largely to the work of Dr Vanessa Teague, Dr Chris Culnane and Dr Benjamin Rubinstein.
Their work has led to a ruling against the Department of Health, which agreed to an enforceable undertaking after the Office of the Australian Information Commissioner found it had breached the Privacy Act earlier this year, but did not leave any individual citizen’s identity “reasonably identifiable” for the purposes of the federal act (which is similar in this regard to state privacy laws).
In his last ruling before retiring, former federal privacy commissioner Timothy Pilgrim decided the re-identification of patient data did not count as a breach because it required significant technical skills to accomplish:
“While there is some risk of re-identification of patients by a sufficiently informed and skilled person, this risk is extremely low. Further, in the event that a possible match between a known person and a patient in the dataset occurs, it would be extremely difficult to confirm whether the match is correct.”
Teague, who didn’t think that ruling went far enough, has previously debated typical risk-based approaches to de-identification with other experts and essentially argued that datasets based on records of individuals — anonymised or not — are best kept in controlled environments, not released out into the wild on a site like data.gov.au.
Writing under the letterhead of their local state privacy authority this month, the three academics warn that it’s not well-intentioned experts like themselves that organisations need to worry about, when they jump aboard the open-data bandwagon.
“Most published re-identifications are performed by journalists or academics,” they note.
“Is this because they are the only people who are doing re-identification, or because they are the kind of people who tend to publish what they learn?”
Organisations hold vast stores of auxiliary information
“Although by definition we won’t hear about the unpublished re-identifications, there are certainly many organisations with vast stores of auxiliary information,” the report warns.
The researchers explain how re-identification works for those who haven’t been keeping up; essentially, the internet is awash with this kind of “auxiliary information” that can be obtained easily and used to cross-reference with anonymised records.
“The database of a bank, health insurer or employer could contain significant auxiliary information that could be of great value in re-identifying a health data set, for example, and those organisations would have significant financial incentive to do so,” warn Teague, Culnane and Rubinstein.
“The auxiliary information available to law-abiding researchers today is the absolute minimum that might be available to a determined attacker, now or in the future.”
They restate their view that the federal government’s idea of making re-identification a crime — even if done by experts with the intention of highlighting risks — is likely to be “ineffective, and even counterproductive” based on quite simple logic:
“If re-identification is not possible then it doesn’t need to be prohibited; if re-identification is straightforward then governments (and the people whose data was published) need to find out.”
The report goes on to offer technical guidance for public service leaders from the perspective of people who know how to re-identify data. In the Medicare data they looked at, for example, they know that Kevin Rudd’s record was not among the 10% of records in the sample — because they found an article online stating the exact date when he had an aortic valve replacement in Brisbane.
“However, if Mr Rudd’s record had been chosen (and there was a 10% chance it would have been), the re-identification of that record would have implied the retrieval of the entire 30 years’ worth of his medical billing history – information that is not otherwise available online,” they explain.
They also attempt to explain why public servants might hear conflicting advice on the matter emanating from different academic disciplines.
“Scientists who regard de-identification as working well tend to come from statistical or medical disciplines; the charge that it doesn’t work tends to come from computer scientists with training in more adversarial notions of information security, cryptography, and privacy.
“The notion of risk in these two communities is very different – one is statistical and random, while the other envisages a determined and ingenious adversary with full access to a wide collection of auxiliary information.”
Coming from the second of the two communities, the authors warn that the risk can’t be eliminated.
“For detailed unit-record level data, there is no de-identification method that “works” in the sense that it preserves the scientific value of the data while preventing re-identification by a motivated attacker with auxiliary information.“Releasing the data publicly in the hope that ‘de-identification’ provides protection from a privacy breach is … a risky enterprise.”
“Evaluating the re-identification risk of publicly releasing unit-record-level data is impossible, since it requires knowledge of all contexts in which the data could be read.”
In the end, to preserve the utility of data that is based on individual records about citizens, it is safest not to release it to all and sundry as open data, but restrict it to trusted users in various ways that are discussed in the report.
“Wherever unit level data – containing data related to individuals – is used for analysis, OVIC’s view is that this is most appropriately performed in a controlled environment by data scientists,” Rachel Dixon explains in her foreword.
“Releasing the data publicly in the hope that ‘de-identification’ provides protection from a privacy breach is, as this paper demonstrates, a risky enterprise.”