Skip to main content

Datasets and metadata

Metadata Capture is designed around the concepts of datasets, metadata, distributions, and data lifecycle management. These concepts shape your interactions within Metadata Capture.

What is a dataset?

A dataset is a structured collection of data created for a specific purpose. Datasets represent the actual data holdings within your organisation.

For example: 0

  • Dataset 1: The dataset, Diabetic Patient Records 2024, contains information about diabetic patients of a hospital such as name, age, gender, and medical history.
  • Dataset 2: The dataset, Air Quality Measurements 2023, contains daily air quality measurements such as PM2.5, NO2, and O3 levels.

Datasets vary in size, format, complexity, and accessibility. For example, some enriched data contained in the Air Quality Measurements 2023 dataset are restricted due to quality-control processes.

What is metadata?

Metadata is information that describes the characteristics of a dataset. While the dataset contains the actual data (such as patient records or air quality measurements), metadata describes the context, content, and accessibility of that data.

Using the same dataset examples above, here's an example of their metadata:

Metadata propertyDataset 1 metadataDataset 2 metadata
TitleDiabetic Patient Records 2024Air Quality Measurements 2023
PublisherCity HospitalEnvironmental Agency
Keywordsdiabetes, medical historyair quality
Access rightsRestrictedPublic
File formatCSVJSON
Metadata simplified: Think of a library
  • The dataset is the book in the library.
  • The metadata is the library catalog card that tells you the author, title, and location of the book in the library.
  • The distribution is the format—physical book, ePub, audiobook—for the same edition.

What are dataset records?

In Metadata Capture, you create dataset records—that is metadata structured according to DCAT-AP-LU metadata model, an international standard for structuring dataset metadata.

When you create a dataset record, you fill out the properties defined in the DCAT-AP-LU metadata model, such as title, description, publisher, keywords, and access rights. The dataset records (metadata) make datasets discoverable, understandable, and reusable. They enable data consumers to find the data they need, understand its context and limitations, and determine if it fits their intended use.

Know your terms

In Metadata Capture, dataset records are synonymous with metadata records. When you create a dataset record, you document the metadata that describes a dataset, not the data itself. The actual data is not stored in Metadata Capture but is linked through distributions that point to the data's access methods.

Why metadata matters

Metadata is essential for effective data management, sharing, and reuse across organisations and sectors.

Organisations in all domains—health, environment, mobility, finance, both public and private—depend on accurate and comprehensive metadata for research, policy making, and operational needs. For example, if a health organisation needs to access specific health data to control disease outbreaks, they can only find and use that data if it's properly documented. The metadata tells them the structure of the data, its quality, and its limitations.

For organisations to share and reuse data among themselves, they need clarity about the data's context, format, quality, and accessibility. Structured metadata provides this clarity by documenting datasets so others can find them, understand them, evaluate them, and integrate them into their work.

Metadata is a fundamental building block to enable reuse, support data sharing, and strengthen evidence-based decision making and policy development.

Data holders and data consumers

As you work with datasets, you'll interact with different participants in the data ecosystem:

Data holders are organisations or individuals who have the legal right to provide access to datasets and decide how they can be shared. They are responsible for safeguarding the data, following legal requirements, and allowing others to use them when appropriate. In Metadata Capture, data holders typically have roles such as Editors, Validators, and Approvers who manage the dataset record lifecycle from creation to publication.

Data consumers or (data users) are organisations or individuals who lawfully receive data and are allowed to use it for specific purposes. Data consumers may analyse, process, or apply the data, but they do not control who else can access it. They rely on metadata to discover and understand available datasets.

What is a distribution

A distribution is a specific representation (format) of a dataset that provides access to the data.

The same dataset can have multiple distributions that offer data in different ways, such as:

  • Different formats: CSV, JSON, PDF, etc.
  • Varying levels of granularity: Aggregated statistics vs. raw data
  • Different access methods: Direct download, API, etc.

For example, the Air Quality Measurements 2023 dataset has two distributions:

  • Distribution 1: Raw Daily Measurements. This distribution contains the raw data of air quality measurements collected at each monitoring station. It includes the station ID, station name, measurement date, and pollutant concentrations (PM2.5, NO2, O3 levels). This distribution is available in CSV format and can be used for detailed analysis, research, and modelling.

  • Distribution 2: Aggregated Annual Statistics. This distribution contains aggregated data derived from the raw measurements. It includes the reference year, station ID, and summary indicators such as yearly mean concentrations for PM2.5, NO2, and O3. These indicators support national reporting, long-term trend analysis, and compliance assessments. This distribution is available in PDF format.

The aggregated distribution relies on part of the raw distribution, which means it may have processed only a subset of the variables of the full raw data but all the values for each variable. The overall trends or "shape" of the data in this distribution is the same as in other distributions. Despite differing in format, the distributions belong to the same collection of data.

Documenting distributions

It is important to describe, through metadata, all potential formats and levels of granularity of each distribution so that data consumers can choose the one that best fits their needs. Different data consumers have different needs. For example:

  • A policy maker may only need aggregated annual statistics.
  • A researcher conducting detailed analysis may need the raw daily measurements. An aggregated distribution may not be sufficient to satisfy the specific requirements of a thesis or research objective. In such cases, the researcher needs the raw distribution instead.

Learn more about adding distributions to your dataset record: Add distributions.