Open Data Discovery Specification: A Universal Standard for Metadata Collection

German Osin
Open Data Discovery
8 min readMay 4, 2022

--

Data is the lifeblood of AI. It is data that most significantly contributes to the quality of solutions powered by machine learning.

A critical asset for any enterprise, data can be extremely challenging to take advantage of since:

  1. The volume of data in different forms and formats is growing rapidly;
  2. Existing data solutions are numerous, yet they do not entirely cover the needs of IT organizations and data teams;
  3. Data pipelines and data infrastructures are becoming too complex to handle effectively

Data has become too complicated, and it takes more and more time for professionals to collect and manage it efficiently and at scale.

This article provides an overview of Open Data Discovery Specification and explains why it is the solution for resolving the challenges of data collection and data management.

Open Data Discovery Specification 101

Open Data Discovery Specification (ODD Spec) is an open-source, industry-wide metadata standard that enables engineers to collect and export metadata from cloud-native applications, infrastructures, and other data sources.

ODD Spec defines a schema for collecting metadata and specifies the semantics of data discovery. Data source-agnostic by design, it can be integrated with various data tools through endpoints to receive their metadata.

Image by the author

ODD Spec allows for building data catalogs with features such as data federation, proper end-to-end lineage, data quality assurance, company-wide observability, and discoverable ML assets.

The Problem of Data and Metadata Discovery

As the amounts of structured, semi-structured, and unstructured data continue to grow, organizations face challenges that never existed before. Today, any professional who deals with data has to answer the following questions:

  • As a ML Engineer or Data Analyst, how do I find data that fits my use case and is scattered across multiple systems and owners? How do I find responsible data owners if I find data issues or have questions about a source?
  • As a Data Analyst, how can I improve my data profiling processes and define sensitive data more efficiently?
  • As a Data Steward, how do I merge data from different data catalogs that use proprietary non-interchangeable metadata formats? How do I build federated systems if existing data catalogs do not support them?
  • As a Data Engineer, how do I make sure that my depreciation strategy will not break downstream data entities?
  • As a Data Steward, how do I expand the functionality of existing data catalogs beyond static data assets like tables and schemas?
  • As a Data Engineer or ML engineer, how do I tackle building and documenting ML models and data pipelines in locked-in data catalogs?

All of these questions bring us back to data discovery. Not only does it take over 30% of all time spent by data teams, but it also creates the problem of access to data. For teams building AI solutions, data access is the major barrier to implementing AI & ML at scale within their organizations.

Some open-source initiatives are already trying to address these challenges.

Open-source data catalogs like Amundsen, DataHub, and Marquez are designed to reduce the time of data discovery. However, the data discovery process in these catalogs is pre-defined for the user and the possibility of re-using discovered metadata within other data discovery products is often very limited.

Marquez has introduced the OpenLineage specification to standardize data lineage discovery. The specification does not cover critical entities that are not part of data lakes and warehouses: Dashboards, Pipelines, and ML Models. It does not allow for metadata enrichment through quality tests, test results, and data profiling.

Open Data Discovery Specification was developed to close the gaps that other tools and services have not yet covered, or do not cover entirely. The idea behind ODD Spec is simple:

ODD Spec provides the means and capabilities to unify metadata formats, enabling a more efficient, transparent, and consistent exchange of metadata from various sources, between multiple parties who need to discover data.

Innovating the Data Discovery Space with ODD Spec

We believe that Data and ML Engineers should not have to spend 90% of their time to ensure that data is clean and reliable. Tasks such as fine-tuning, debugging, and maintaining data pipelines, as well as cataloging and curating datasets, should not create data silos that prevent engineers from working on ML models, business-critical analytics, etc.

An open-source, open-standard ecosystem for collecting and managing data can dramatically reduce the costs of building and maintaining data products for enterprises of all sizes.

This also concerns metadata. It should be easy to query and access through standard open API. Any data source, from DWH and data pipelines to ML model repositories, should support open API, to be self discoverable in any tool and any cloud.

Building on this line of thought, we have designed Open Data Discovery Specification with such core features as:

  1. Standard Open Data Discovery API (ODD API) democratizing data through open and transparent exchange of metadata between different systems: data sources, data pipelines, ML models, already existing data catalogs and next-gen ones;
  2. Flexible data discovery models that are designed to be scalable in rapid processes of data landscape evolution and capture data entities at the moment of their creation;
  3. Federated ‘Catalog of Catalogs’ for data discovery to unlock organization-wide data discovery;
  4. Open-Source reference implementation based on the ODD API specification;
  5. Community-driven structure to achieve better compatibility with a wide variety of third-party integrations and data tools

We envision the data discovery process before vs. after ODD Spec as such:

Image by author

In the currently existing data discovery ecosystem, various data sources (e.g. feature stores, ETL tools, ML pipelines, data warehouses, data quality tools, etc.) and data catalogs have to exchange data directly. Open Data Discovery Spec offers a standardized ODD Adapter that enables data exchange in a unified format.

Image by the author

As can be seen from the image above, the data discovery process based on ODD Spec involves pull, push, and federation strategies. In this ecosystem, any data source and data catalog can expose an ODD Adapter API or use adapter microservices for data discovery. A push strategy can be applied to combine data with already discovered data entities. Data catalogs based on ODD Spec operate only metadata and never consume real data by their design. This approach ensures data compliance across all levels of organization.

With this in mind, it is clear that ODD Spec has the potential to benefit various groups of users, from data catalogs to enterprises. Let’s take a closer look.

Benefits of ODD Spec

The major benefit for data catalogs and data assets is realized through better integration with other tools and services. ODD Spec enables them to provide better, more easily accessible products to their customers, resulting in wider adoption on the market.

In the case of enterprise clients, major perks come with faster, more efficient evaluation of data, rapid access to trusted data sources, and more convenient exchange of metadata between teams, which leads to faster time-to-market for their products — be it advanced analytics dashboards, ML models feeding from various verified data sources, or real-world AI/ML-powered solutions.

Data Discovery Models in ODD Spec

Push and Pull

Metadata discovery with ODD Spec is similar to the collection of metrics, logs and traces, which can be done by using pull or push models. Because every model has its own range of applications, ODD supports both models to cover major use cases.

Pulling metadata directly from its source is the most straightforward method of metadata collection. However, it can be challenging to develop and maintain a centralized fleet of domain-specific crawlers. When metadata is pulled from multiple sources with no standard guiding the process, multiple source-specific crawlers must be created for each adapter. This is a complex and ineffective solution. ODD can resolve this issue with ODD Adapter, a universal API for metadata collection and processing.

ODD Adapter is a lightweight service that is integrated into a data warehouse as a proxy layer for metadata collection in a standardized format. The adapter receives requests for data entities and returns those entities in response. The adapters are designed to be source-specific and to expose only the information that can be pulled from a particular data source.

Image by author

The pull model is recommended when you have good latency on index updates and have an adapter already integrated into the system.

Image by author

The Push model supports the process where individual metadata providers push information to the central repository via APIs. The model is preferable for use cases like Airflow job runs and quality check runs.

Data Discovery Federation

Open Data Discovery comes with a built-in data federation feature, allowing multiple databases to function as one. It also allows for data entries to be efficiently scraped from other ODD servers.

Open Data Discovery covers two major use cases for data federation:

  1. Building scalable data catalogs
  2. Pulling data entities from one ODD server to another

It also employs two types of data federation.

#1 Hierarchical Federation

Hierarchical federation allows ODD servers to scale to environments with dozens of data centers and millions of nodes. The topology of hierarchical federation resembles a tree where higher-level ODD servers collect data entities from a larger number of subordinated servers.

A hierarchical setup, as seen in the image below, may include many ODD servers per datacenter, each of them collecting data in high detail (instance-level drill-down), and a set of global ODD servers that collect and store data from these local servers.

#2 Cross-Service Federation

In the case of cross-service federation, an ODD server of one service is configured to scrape selected data from another service’s ODD server, to enable queries against both datasets within a single server.

Conclusion

Open Data Discovery Specification is an open-source, industry-wide standard for metadata, designed to establish rules for how metadata should be collected, processed, and managed in an automated manner.

This article provides a brief overview of ODD Spec, with inroads into the challenge, the solution, major features and benefits, and the specifics of the data discovery process.

To learn more about Open Data Discovery Specification and our vision for data model specification, please check out specification.md on GitHub. There you will find the code and examples you can use to explore the spec in more detail.

If you have any feedback or suggestions, feel free to reach out to me in the comment’s section, or contact me directly on LinkedIn. Thanks!

P.S. I have recently shared what I think about data discovery and observability for ML solutions in an interview for TheSequence. Please, check it out to get a bigger picture about ODD Spec.

--

--

Senior Solutions Architect at Provectus, AWS-certified. Strong track record of building Big Data & ML/AI solutions. 15+ years of experience.