CSIRO’s Data61 and the New South Wales government have teamed up to develop a tool that allows important datasets to be shared with the public while ensuring sensitive personal information remains protected.
The NSW government has been using an early version of the tool since March 2020 to analyse, secure and then release datasets tracking the spread of COVID-19 in the state. The technology is also being used to examine other datasets — such as domestic violence data collected during COVID-19 lockdowns and public transport use — before they are publicly released.
Releasing de-identified data can be risky, but the Personal Information Factor (PIF) tool uses a sophisticated data analytics algorithm to identify the risks that sensitive information within a dataset can be re-identified and matched to its owner, according to the CSIRO.
This allows for “targeted and effective protection mechanisms to be put in place”, the organisation said in a statement on Thursday.
The tool has been developed in partnership with groups such as the Cyber Security Cooperative Research Centre, and has been tested using datasets from the NSW and Western Australian governments, and the Australian Computer Society.
At a time when public awareness of the need for data privacy has increased, NSW chief data scientist Dr Ian Oppermann said the tool has helped the state government minimise the re-identification risk of datasets of COVID-19 cases before releasing them to the public.
“Given the very strong community interest in growing COVID-19 cases, we needed to release critical and timely information at a fine-grained level detailing when and where COVID-19 cases were identified,” he said.
“This also included information such as the likely cause of infection and, earlier in the pandemic, the age range of people confirmed to be infected. We wanted the data to be as detailed and granular as possible, but we also needed to protect the privacy and identity of the individuals associated with those datasets.”
New ways to de-identify data can better protect personal information, according to Data61’s senior research scientist and project lead researcher Dr Sushmita Ruj.
“Having studied other privacy metrics, the team concluded a one-size-fits-all approach to estimating the re-identification risks of unique applications of data can be significantly improved upon,” she explained.
“The evolving approach to a PIF takes a tailored approach to each dataset by considering various attack scenarios used to de-identify information. The tool then assigns a PIF score to each set.”
The CSIRO noted that when the PIF is higher than a desired threshold, the program will make recommendations on how to design a more secure framework to ensure the dataset can be safely released to the public.
The Australian Computer Society and Oppermann have been working to address the issue of de-identifying data for several years, while the CSIRO and the Cyber Security Cooperative Research Centre have been looking at ways to enhance the PIF tool since 2020. They expect to make it available for wider public use by June 2022.