Difficulty finding and accessing data is an experience faced by many staff in many agencies. Often it is because agencies do not have a central repository or a ‘single source of truth’ of their data assets. There may be multiple and fragmented catalogues and systems, which do not always interact.
Ensuring data is discoverable has many benefits:
- It can prevent data duplication and reduce costs. Where there is a lack of visibility of data holdings across an agency, data assets can be reproduced multiple times, often at a significant monetary cost or data management burden to the organisation5. Duplicated data assets might also be stored in multiple locations across an agency, resulting in increased technical infrastructure costs, data security risks and management overheads.
- It promotes reuse of data for improved outcomes and a reduced unit cost of managing data. The more people that can find out about what data is available, the more opportunities there are to use that data in new or different ways. The resources involved in managing the data are then benefiting many more outcomes.
- It ensures that constraints on data use are known.
- It manages the risk of ‘loss of corporate knowledge’. To make up for insufficient technology and a lack of discoverability, staff often rely on their own established networks and connections to search for and discover data (i.e. ask their colleague in another team where to find data). If these relationships did not exist, the data may never be discoverable to anyone outside the direct data managers or stewards6.
- It can make managing data quality easier. By not having an understanding of what data an agency holds, the ongoing management of data is difficult as data assets become out of date, or out of sync, with versions and transformations not being appropriately documented7. It is also difficult to justify the resources required to manage data assets if the full extent of an agency’s data holdings are not known.
Data inventories, registers or catalogues make it easier to find, access, trace, standardise, use, analyse, manage, govern and communicate about data.
Agencies that are very mature in their data management capabilities may decide to automate much of their cataloguing and focus on developing the quality of the data they hold.
5 Data61 Web Geospatial Visualisation Research, 2019
6 Data61 Web Geospatial Visualisation Research, 2019
7 Data61 Web Geospatial Visualisation Research, 2019
There are many terms to describe listing and storing data assets – the concepts are often distinct, but there is also overlap involved.
A Data Inventory or Data Asset Register is a list of data assets. An inventory should record basic information about a data asset, such as name, description, license, owner/custodian, reference period, etc.
A Data Catalogue is a piece of technology that, at the most basic level, enables a user to search for and locate data for their specific needs. It relies on well-described metadata to enable search functionality.
A Data Repository is a data storage location. A data repository can store data according to discipline (e.g. clinical data, ecological data), or it can be a single enterprise storage location for all of an agency’s data.
A data catalogue requires a data inventory to populate it with information. A data inventory or data catalogue may reference a data repository as a storage location for one (or more) data assets described within the inventory/catalogue.
This document focuses on data inventories as a way agencies can identify and record the data resources they have. This can be built on to create a searchable data catalogue.
Data cannot be effectively managed by agencies unless they know what they have and where it is held. A data asset stocktake or inventory provides a way of maintaining a ‘single source of truth’ regarding the location and status of data assets.
An inventory, using a recognised standard, with rich metadata to support it, establishes a baseline from which data can be managed. Data stewards or data managers can access relevant information about a data asset, such as when it was collected or acquired, its licensing arrangements, data formats, methodology (if data lineage or provenance is captured) and potentially, when the data asset should be updated, archived or disposed. It can also improve understanding of the key stakeholders and users involved with a data asset.
A well-managed data inventory supports broader data governance processes including:
- Data access control. A good inventory should display the record for the data asset, but not necessarily provide access to it. The record should outline who can access the data and under what conditions.
- Assigning and managing responsibility for, or ownership of, the data assets so there is clear accountability (e.g. data stewards, business owners, data custodians, etc.).
- Notifications relevant to information management, such as when data assets need reviewing for archiving or disposal purposes. This enables agencies to comply with the Archives Act 1983.
- Maintaining a central list of personal information holdings as required under the Australian Government Agencies Privacy Code (10.5.(b)).
- Tracing data lineage or provenance. This is the ability to record movements, changes and interactions with the data throughout its lifecycle (i.e. as it is acquired, processed, analysed, shared and released).
- Decisions relating to data architecture requirements (e.g. the creation of a searchable data catalogue in future).
A data inventory is most useful when it is more than simply a list of data assets. Knowing what assets an agency holds is important, but there are a number of other factors that will make it a more effective management and discovery tool.
Metadata describes a set of information (e.g. date created, date updated, source of data, security classification, etc.) about a dataset.
The metadata attributes, which would form the basis of the Data Inventory, should be the metadata most useful for agency data management and/or for users to identify and request access to data relevant to their work. Reliable access and licensing information can indicate what data can be shared or accessed and by whom, removing the need for staff to make ‘judgement calls’ and minimising the risk of inappropriate sharing, access or use.
The ability to search the list of data holdings and discover data is dependent on the quality and consistency of the metadata. If asset records contain complete and accurate keywords and dates, clear descriptions, accurate titles and are human-readable, searching will be far more effective.
This means a data inventory is only as useful as the metadata that sits behind it. Ideally, metadata should be based on recognised standards. There are two standards set by the National Archives of Australian: the Minimum Metadata Standard8 and the Australian Government Recordkeeping Metadata Standard9. There are several international standards which can also be adopted: the Dublin Core Metadata Initiative10, DCAT11 or ISO 19115-1:201412. Using established metadata standards will enhance the comparability of data inventories or catalogues, within and across agencies.
Examples of metadata attributes for possible inclusion in a data inventory:
- Data asset name
- Description (e.g. purpose, collection method, etc.)
- Topic / key words / themes
- Authority for collecting the data, e.g. the legislation or policy driving its collection
- Business owner / Data steward / Data custodian
- Usage caveats (e.g. software requirements, coverage limitations, etc.)
- Time period covered
- Geographic coverage
- Data dictionary (i.e. definitions of key terms)
- Date created / modified
- Frequency / time series
- Protective marking
A useful data inventory will be well-governed and support other data governance and management processes. Well-governed means that it will have clear guidelines on what level of detail needs to accompany the attributes describing the data asset. This may include:
- Minimum attributes or ‘fields’ that must be completed. There should not be too many to be burdensome, or too few that the asset is not adequately described.
- A data dictionary or glossary describing what each attribute means. This will ensure consistency of interpretation of content.
- Controlled vocabularies or ‘pick-lists’ containing agreed terms describing an attribute. This minimises free text and burden for completing an entry and assists with consistency.
- Guidelines regarding what content is expected in each attribute where free-text is an option. This is essential for describing and understanding the data asset.
A well-governed data inventory will also have a person or team that supports and maintains it on an ongoing basis. Records in a Data Inventory will need to be kept up to date for it to remain useful to the agency. Data stewards should be responsible for updating the inventory as needed, but having a coordinator can ensure this occurs when the task is ’forgotten’ by stewards. The Data Inventory should also be made available to all staff for ease of maintenance and to realise the benefits of data discoverability.
Ideally, a data inventory is built around a comprehensive ‘stocktake’ of an agency’s data holdings, which could include financial data, corporate data and client data in addition to what is traditionally considered to be the operational data unique to the organisation. Whilst comprehensive coverage is desirable, it may be easier to prioritise efforts on operational data first, and then move onto corporate data. This is a decision for agencies to make based on their resources and context.
For agencies with multiple data inventories it may be necessary to consolidate these if they are legacy systems or contain duplicate records. Where multiple inventories are necessary, ensure the purpose and content of each one is clear. In such cases, agencies may wish to have some form of meta-catalogue, or catalogue of catalogues, to enable easy searching.
Integration with existing processes
Recording data in the inventory should not be a discrete activity; it must integrate with other data management processes. It should also align with an agency’s information and record management policies and strategies, as a data inventory is another form of record and information management.
Undertaking a data inventory and establishing standard cataloguing processes is an area where the Senior Data Leader and a CIO can lead together. The CIO and Senior Data Leader can work to streamline all data/record/information management processes to ease the burden on staff and boost information management capability.
As part of the Australian Government Agencies Privacy Code, agencies must keep an inventory (formally described as a register) of personal information holdings. Given much of the personal information agencies hold are in datasets, the data inventory will satisfy this requirement. Alternatively, if no broader data inventory already exists, it may provide a starting point for a data inventory. Teams working on a Data Inventory should investigate, leverage and potentially adapt what already exists in an agency.
As agencies continue to develop their data maturity and capability, they may consider transitioning this inventory into a data catalogue. This opens up further possibilities and benefits, including automating a range of data management and governance processes, improving the discoverability of data, creating externally facing as well as internally facing catalogues and possibly aligning or integrating catalogues across agencies.
There are many off-the-shelf and open source cataloguing options available. It may be hard to know which one will suit your agency as there is no one-size-fits-all option. There are benefits to purchasing a vendor-supported product, but also benefits to using more flexible and cheaper open source software and even benefits to developing an agency-tailored catalogue in-house. These decisions are heavily dependent on an agency’s existing data maturity and IT infrastructure.
When selecting catalogue software specifically, some important considerations for agencies are that the catalogue:
- meets the business need
- can integrate with other business systems and/or data infrastructure
- can integrate with other Enterprise Architecture or ICT registers, such as business registers (e.g. capturing business process and legislation, which can then link to datasets associated with that business process/legislation), application registers (e.g. so that datasets can be linked to the applications that house the datasets) and infrastructure registers (e.g. so that the servers that host applications can be linked).
- can appropriately extract and/or maintain metadata for individual data assets, and
- is intuitive to use: this should be judged not only by the technical support, but all users of the catalogue
It may be helpful for agencies to liaise with other agencies that have developed or purchased a data catalogue to understand the pros and cons of different systems and approaches.
Below are a few articles that can be used as a starting point:
- Choosing a Data catalog (Ekerson Group)
- Open Government Data Toolkit: Technology Options (World Bank)
- Data Discovery and Catalogues (Bloor Research)
Questions to ask:
- What type of data ‘stocktake’ is most appropriate for our agency? An inventory, or catalogue?
- Who needs to be convinced of the benefits?
- What needs to be included in the data inventory or catalogue for it to be useful?
- Who will be involved in creating the data inventory or catalogue?
- How will the data inventory or catalogue be maintained?