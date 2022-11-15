We’re still on our data journey but something’s brewing

Data is the lifeblood of the modern economy. It impacts, enables and personalises how we work, play and engage socially. It’s crucial for the operation of the economy and society.

Banks and financial services companies can increasingly be thought of as data and digital services organisations with some bricks and mortar operations. Without trying too hard, that analogy can be applied to all sectors of the economy, including government.

Value comes from creating, using, protecting and sharing data. The “use” of data is a very broad term, incorporating analysis, storage, aggregation, dissemination and deletion.

Yet, amazingly, systematic data sharing is still a struggle. Two of the main reasons for hesitation in sharing data are concerns:

that data isn’t generally ‘fit for purpose’ beyond the initial reason it was collected;

about the lack of control as data and products generated from that data are used and reused.

The real-world lifecycle of data often has many twists and turns, and the ‘hands’ touching the data or data products can involve diverse actors. A data lifecycle may include many connections, multiple regulatory environments, sharing in many forms and different uses for the data once received.

The complexity and unknown overall pathways and consequences make many data custodians so hesitant to share data they just don’t do it. Not sharing is their one guaranteed point of control.

What we need are frameworks for appropriately and safely handling data throughout its lifecycle, and guidance on how to safely use products (insights, alerts, decisions) created from data.

Data as the new electricity

Many people have tried to find an analogy for data to help us think through what we have, how we can safely use it and what we need to do to harness its power.

Analogies of “data is the new oil/asbestos/water” all have some merit but miss a number of fundamental characteristics of data. A dataset may be relatively benign but joined with another dataset may suddenly change. Data can be used and reused without impacting its quality. Data can be shared infinitely and used differently each time.

My current favourite analogy is to liken data to electricity. It took us more than 100 years to develop ways of safely handling electricity of different voltages and currents, but now electricity is literally everywhere in our lives, from lighting to vehicles, from computers to digital watches. We need to develop safe frameworks to work with the equivalent of 240V data as well as 24,000V data.

We have changed – we’re willing and we’re allowed

I was once fond of saying that the main reasons for not sharing data fell into “unwilling”, “unable” or “not allowed”. We have changed and a large part of that change was driven by the need to respond to COVID with (data-driven) situational awareness and actionable insights. COVID upped the stakes and the perceived value of data and insights.

The Intergovernmental Agreement on data sharing came into effect on July 9, 2021. It commits all jurisdictions to share public sector data as a default position, where it can be done securely, safely, lawfully and ethically. The agreement recognises data as a shared national asset and aims to maximise the value of data to deliver outstanding policies and services for Australians.

We’re now increasingly willing and allowed to share and use data. What remains to be addressed are the repeatable patterns of data use – connecting the ‘principles’ of safe, secure, ethical data sharing and use to the ‘bits’ in a dataset or data product created from it.

NSW has been building out the pieces of the puzzle with the NSW AI Strategy (a framework for data use for AI) released in 2020, the Smart Places initiative (using data to make places ‘smart’) also in 2020, establishing an AI review committee in 2021 to review real-world projects, releasing the NSW Data Strategy in 2021, and most recently, releasing and mandating use of the AI Assurance Framework in March. We’re now hunkering down on developing repeatable patterns of data use.

What is a repeatable pattern and how does it help?

Every dataset is unique, and every product created from that data is unique. But the recipes to share and use data can be repeatable patterns. The main elements are:

determining if we can appropriately access the data;

assessing if the data is fit for the purpose we’re about to use it for;

what guidance or restrictions are required on the further use of the data products created from that use.

The elements that help us work out which repeatable pattern to deploy are largely driven by understanding the data provenance, data quality, level of personal information in the data, the inherent sensitivity of the data itself, and the sensitivity associated with the use of the data products created.

A significant factor in the ability to make those determinations comes from the need for metadata – data about the data – and data about its lifecycle to date. That metadata is almost always incomplete. But if we had it, working out which repeatable pattern to use would be more straightforward, as would be identifying the protections needed to share data (and data products) more widely, for more reasons.

A real-world example of data use: COVID case reporting

In March 2020, the NSW government committed to releasing information about the developing number of confirmed COVID cases every day at the postcode level.

Concerns related to the level of personal information and the inherent sensitivity of the data. This was balanced by the strong desire of the public to be informed about the developing COVID situation.

A complete set of possible fields for release was collated from NSW Health sources and then tested for the total amount of information that would be revealed about individuals if released (and the individual was identified).

A series of consultations were undertaken regarding the balance of data being released “in the public interest” versus data that was merely “of interest to the public”. The risks associated with the reidentification of individuals were also considered and how much information could be associated with any individual identified.

A personal information factor (PIF) tool had been developed and tested some time earlier. The PIF tool was used to develop an upper limit measure of the worst-case (greatest amount of) information that would be released if an individual were identified. This tool and measurement process was used to design additional protections (disconnecting features, aggregation, obfuscation) for the data before it was released as open data.

The data in the reduced feature tables are analysed each day to ensure the PIF is reduced to an agreed level before release. The dataset is assumed to be in the form of rows (unique individuals) and columns (features related to those individuals).

The data released was also used to create daily updated spatial maps of COVID cases in NSW. The dataset and maps were updated daily for approximately two-and-a-half years. Two data product sets were created:

High control environment: Unit record-level data with personal information and unique rows. This was accessed by data custodians and analysts working under the regulatory environment operational during the COVID health emergency in NSW and under conditions of confidentiality.

No control environment: Raw data with reduced personal information and sensitive information, released to the public.

So, now what?

We need to put serious effort into refining the repeatable patterns of use and building awareness of the fundamental importance of metadata. The good news is that there are a range of emerging internal standards (ISO, IEC and JTC1) that are rapidly maturing, and which will help … but only if we’re willing to use them.

Standards and metadata are not everyone’s cup of tea, but there is a lot of serious, quality thinking behind a published standard. And there is a standard for making a cup of tea (ISO 3103).

Dr Ian Oppermann is the NSW Government chief data scientist and industry professor at the University of Technology Sydney.

For more information about the PIF tool, see this NSW Government website case study. For a description of the PIF tool, see this CSIRO/Data61 article.