THE PIA REVIEW: PIA ANDREWS This is a Public Sector Pia Review — a series on better public sectors.
I have been working in government data policy and implementation for 10 years. I have run and managed data infrastructure for open data, spatial data, service analytics and sensitive linked administrative data in multiple jurisdictions and have advised and influenced many more data programs and strategies around the world. I have worked with many of the best in this field globally, and I have seen what works and what doesn’t. I have also built a bit of a reputation as an advocate for opening up government data for public access and reuse.
So when I tell you that data sharing is no panacea, it is coming from a well-informed place.
You see, “data sharing” has become a rather conflated discussion where all data is often assumed to be the same and the same solutions or frameworks that suit one need are assumed to suit everything. There is an overarching and growing myth that sharing more data will naturally lead to better outcomes and that we just need to get around these ‘pesky barriers’ to sharing data and things will be great. Although I still believe open data is important for democracy, economy and society, when it comes to sensitive data, especially unit record data with personal information, more sharing is simply not always the answer. It can create an unhealthy, costly and sometimes dangerous distraction from what could really drive better public outcomes. Appropriately managing sensitive data necessarily requires a lot of work, infrastructure, oversight, security, monitoring and cost, when often what is needed may not require sharing actual data.
For many, data sharing equates to “give me a copy of the data” which means endless copying and pasting between organisations, which creates significant inefficiencies, ineffectiveness, technical and timeliness challenges, and loses the authoritative source of data for insight validation. When the data being shared has personal information, this also creates a potential risk for citizens, when agencies are making decisions based on incomplete historical cross-function data without oversight or validation of insights. It also means many growing stores of cross-functional data that all need to be even more carefully monitored, governed and managed to avoid misuse by internal (eg Tinder stalking by staff) or external threats (eg crackers or criminals). Good data architecture keeps data separate but accessible for different use cases such as analysis tools, secure service APIs, or dashboards, but too often we see singe purpose data architecture that neither scales nor maintains single points of data management and governance, even though that is usually what the internal data specialists try to advocate for.
In this article, I talk about different types of data needs with some practical strategies for improving value realised from data without necessarily having to share data. For the purpose of this article, I will treat data in two simplified categories, noting that aggregate information can sometimes fall into both categories, which is a post for another day:
- Non sensitive data (with no personal information) — examples are many, and include budget data, information about and eligibility criteria for government service, locations of public offices, national health or economic statistics, census statistics, lists of any kind (ministers, departments, bookstores), energy ratings of appliances, annual reports, regulatory self-assessment, company data (with the potential exception of sole traders), government procurement data, etc. Generally, this information is appropriate for public access.
- Sensitive data (with personal information) — examples include unit record administrative data that has name, date of birth, address or other personal identifiers, Medicare records, tax returns, raw census data, criminal records, etc. Generally, this information is not in the public domain nor appropriate for public access.
Although researchers generally want access to unit record and often sensitive data, good research projects are subject to appropriate ethical, legal and technical constraints around data access and reuse. Well-governed research data infrastructure provides great benefits but is obviously and necessarily more expensive than just putting something in the cloud to share. Open data is easy and cheap by comparison but still requires some oversight and monitoring to ensure the integrity of non-sensitive public data is not compromised. Too often, people assume the technical costs of data sharing without the necessary costs of oversight, controls and governance proportionate to risk, with the risk profile of sensitive data completely different from non-sensitive data.
So to get cost-effective value realisation from more data, perhaps sensitive data should generally stay with the authoritative source and you could seek or share actionable insights rather than raw data. Perhaps we need to purposefully maintain the idea that agencies which have legislative protections against inappropriate reuse of data, such as the ABS, are the most appropriate linkers and facilitators for broader access to unit record sensitive data, especially for the purpose of research. This would minimise risk but also requires the ABS to provide fast, robust and agile data services to agencies and researchers alike, something entirely possible as the NSW Data Analytics Centre has been great at demonstrating. Individuals in agencies without these legislative limitations can find themselves in the awkward position of a business or management need being prioritised above appropriate limitations of reuse. You should have strict controls on who (and what) can access the data under what conditions, with great oversight and realtime monitoring of usage and threats. I recommend you check out the NSW Data Analytics Centre Security Statement and Data Governance information for a good example of a data program that blends the best of modern, agile and secure data platforms with strong governance and management of sensitive data.
Understanding data user needs
Different data users need quite different things. Data scientists and researchers generally want access to raw unit record data to get the most possible insights from their research, but this is quite different from service providers who usually need simple information, insights, indicators, calculations or verifications of eligibility, which is different again from software developers who need programmatic interfaces like reliable APIs or SQL views. Senior management generally wants reports and dashboards, so automated insights are critical to not find your time entirely consumed by manual processing. Consider the data needs of third parties who build services on and around government, including payroll companies, social service providers and even startups. Of course, there are also those who want data from government to hold it accountable including journalists, community groups and even citizens and there are significant efficiencies to be found in proactively publishing popular or regularly requested data.
Given the different user needs for different data, I would like to gently challenge all those who believe data sharing will solve their problems to consider a few questions to design your data architecture, services and programs accordingly:
- Who are your data users? Consider people, organisations and machines as users.
- How would your different users engage with you to get what they need?
- What are some high-value end-to-end data use cases and what data is needed?
- How do you differentiate sensitive and non-sensitive data?
- How easy is it to publish non-sensitive data of public value?
- What support and services are needed for your users to better use data?
- How can you protect your data from internal and external misuse?
- How you will monitor for dark patterns of use?
- What reporting obligations and high resource work can you automate?
When to potentially not share data
Below are three areas where data sharing may not always be the best way to solve your problem:
1) Creating user-centred & integrated service delivery — if you are trying to improve service delivery, you are probably dealing with a combination of information about services, eligibility criteria, calculation information, different and emerging service channels and the many transactional aspects of service delivery (reporting, payments, applying, etc). If you are taking a truly user-centred approach, you’ll likely find the requirements span different legislation and policy frameworks, sometimes across different portfolios. You are probably also trying to provide your services through websites and apps, as well as to helpdesk functions. Here are some specific areas where you could leapfrog data sharing for better government services that both scale and can be integrated.
- Digital forms — it is often assumed that pre-filling forms will solve the client need, although every time you invest in digitising or sharing data to pre-fill a single form, you are missing the opportunity to create responsive government services around the user journey. With regards to data sharing, if you just try to get data sharing in place to pre-fill, you are still undertaking all the effort and risk of storing and processing requirements that might be better to automate. Application forms generally have two types of data – information you actually need (like a person’s name) and conditions of eligibility you need to validate (such as whether they meet a means test or age requirement). In the case of information you truly need, data sharing with consent from the client might provide some efficiencies, but sharing eligibility information that you still then need to validate seems a missed opportunity. Why not build verifiable claims APIs from authoritative sources to securely verify conditions of eligibility, which would both reduce the processing costs and risk for your organisation, and also improve the dignity and experience of the client? For instance, rather than asking people to provide a payslip which then requires the agency to identify the relevant data, process the eligibility and store the payslip securely, what if you asked the client if they approved for you to check with the ATO that they meet the service means test? Or checked with BDM that they meet the age test?
- Eligibility rules — many rules of eligibility or calculation of services, taxation or regulation are based in legislation, and yet machine consumable versions of these rules are rarely available in an authoritative form that can be consumed for service delivery. This means agencies individually create their own rules engines and then try to share data to run through their own rules, with variable results. If the rules were publicly and authoritatively available, the agencies and everyone else could run the same rules over their data without having to share data between different rules engines to get a more-consistent application of the rules across government.
- Measuring change and impact of services — maybe rather than sharing sensitive and normative administrative data about individuals, you could better identify and understand user journeys, areas of need and impact of change through analysis of anonymised service analytics like website logs, transactional logs, helpdesk statistics, etc. Again, design the data need around what you are trying to achieve.
2) Scaling and modernising regulation or compliance — if you are struggling to monitor and audit compliance or regulatory impact, with increasing pressure to prioritise the highest risk areas due to decreasing auditing resources at your disposal, then sharing even more data between regulatory agencies may actually create more work without necessarily scaling your impact to meet the growing problem space. You might be better served to leverage natural allies in compliance, which includes the customers, staff and competitors of the organisations you are regulating. How can you leverage these groups? By publishing regulatory self-assessments in the public. This would create the opportunity for many more eyes (that are naturally motivated to do so!) to identify issues or areas for prioritisation, whilst also nudging those organisations towards greater compliance over time.
3) Better policy outcomes — policy teams are always under pressure and rely heavily on the subject matter expertise of others. Although there are some policy experts that are also data scientists, many policy teams are better served by having access to analysis tools and reliable data insights, trends or metrics. In this way, it is worth supporting policy teams to self serve through providing access to analysis tools that obscures access to any unit record data for better data insights, and to pair with data teams around establishing persistent policy measures and indicators that could draw from data anywhere in the world without having to simply copy and paste between systems.
Different types of value realisation from different types of data
It might be worth briefly outlining different sorts of value that can be realised from different data types, and generally where we see the most investment across different jurisdictions. Consider what sort of value you are trying to realise and what sort of data is involved.
What you need to do to do data properly
Everyone talks about the need for data legislation, policies and platforms, and I’ve spoken briefly about the necessity of good governance and data management, so I’ll cover some of the less known but critical enablers for better value realisation from data that also maintains privacy and appropriate reuse that maintains the dignity of the people of Australia.
Support for agencies
In every jurisdiction there are a small number of agencies who are data specialists, for example, the Australian Bureau of Statistics, Geoscience Australia, NSW Transport and Health, Data Analytics Centres, etc. Most departments, however, have limited data expertise, platforms or strategy. There are shining examples of data excellence found but in my experience, holistic and strategic approaches to data are limited to about 10% of agencies or less in any one jurisdiction. What this means is that progressing a data agenda requires coordinated and all of government support services to agencies, not just policies or legislative directions.
Modular data infrastructure
A lot of data infrastructure in government is built bespoke to a particular use case and in the context of a single agency need. We see a lot of data copied around and dumped into myriad analytics tools. If analysis tools were only talking to data through programmatic interfaces (SQL for unit record access to relational data, JSON for pre-defined queries, WMS/WFS for spatial context, etc), then we wouldn’t see so many data siloes. It would also make it easier to secure and monitor data usage for any unauthorised or inappropriate patterns of access. If data infrastructure was always built to separate the data from the analysis tools, then the data infrastructure could be built to support multiple use cases including multiple analysis tools, aggregation, sharing and publishing.
If government data infrastructure was built like any other national infrastructure, it should enable an open and competitive marketplace of analysis, products and service delivery both domestically and globally. A useful analogy to consider is the example of roads. Roads are not typically built just from one address to another and are certainly not made to only support certain types of vehicles. It would be extremely inefficient if everyone built their own custom roads and then had to build custom vehicles for each type of road. It is more efficient to build common roads to a minimum technical standard that any type of vehicle can use to support both immediate transport needs, but also unknown transport needs into the future. Similarly, we need to build multipurpose data infrastructure to support many types of uses.
If you are not monitoring the usage, access, trends and identifying potential issues of your data in realtime, then you may have a problem. We have to design data and all other digital infrastructure to assume that machines are end users just as much as humans, and therefore a lot of the human instigated means of monitoring (weekly reports, spot audits, penetration testing, IRAP reviews, etc) will only catch some of the risks, and rarely in time to mitigate a major issue. Monitoring is to 21st century security what locked gates are property security, and yet many people still assume a “locked gate” security mentality for digital access systems. I know this is changing, but I strongly recommend that it is worth considering monitoring as almost more important than access controls, because that is where you will detect normal or abnormal usage patterns in realtime in order to mitigate when the usual access controls are compromised.
Engage openly with experts and communities
Contrary to many of my colleagues in government, I have long valued and admired at times the work of privacy and data advocates who actively try to reverse engineer the systems and datasets of government. These people and organisations are skilled and often motivated in trying to ensure public good is upheld and if you engage constructively with them, you will find people and communities who are well placed to help you ensure better security and privacy outcomes. Why fear criticism? I know many public servants are quite understandably scared of the “front-page test” (no one wants to make the front page of the newspaper) but if it were normal practice for public servants to engage publicly on operational work, then such feedback could quickly and effectively help drive better services, policy and programs. There is a whole discussion here to have, but to keep it practical for data professionals working in government, if you are able to engage with privacy advocates, you can actually reduce the risk and potential exposure of yourself, your teams, your department and the government by gaining an independent perspective that will likely come out at some point anyway.
For those interested, I wrote a detailed personal submission to the Productivity Commission in 2016 where I outlined 38 recommendations to improve value realisation from public sector data. Some of my thinking has evolved in the past three years, but there is still a fair amount of potentially useful things to consider therein.