Metadata & Lineage

The root of understanding

Metadata describes the content and structure of the data that flows through the system. It is the key to building a common understanding of the data from end-to-end.

It provides a linkage between the structural definitions within technology systems and the points from where data originates both conceptually (as an idea) and physically (from another system or location). 

It allows similarities and differences within data to be described by nuanced business descriptions.

To extract maximum value from data, all organisations must have a metadata strategy from the point of capture through to the point of exploitation.

Defining connections

Individual metadata elements may exist within business glossaries, data models, technical definitions of systems, reference data catalogues and other dictionaries. 

The relationships between these elements is called lineage. Lineage tells consumers of the data about the provenance of the data, whether it can be trusted, how it may be used and whether it can be further shared.

 

Maximum clarity, maximum value

Managing metadata is key to building an effective data landscape. In earlier times it was difficult and expensive to modify source systems to add new fields but within a Big Data environment, where it is cheap and easy to ingest data from a wide array of sources, the cost of adding new fields has been moved from the point of data capture to the point of downstream integration and exploitation. Where there are multiple consumers of the same data, the cost of integration is increased exponentially where poorly defined upstream metadata leads to duplicated or multiple similar downstream integrations being carried out multiple times. 

There is an old adage that fixing data at source is 100 times cheaper than fixing it and cleaning it up downstream. That must surely apply also to defining and managing metadata as close to source as possible. The most effective thing that a Data Architect can do to increase overall value is to increase interoperability. And interoperability is completely dependent on accurately defined and documented metadata at the point of capture, integration and exploitation.

Data will flow within and beyond an organisation wherever it is needed to flow – data can be exploited far beyond where it is captured. In these extended data supply chains, who is responsible for defining metadata? This is where it is helpful to think of decentralised domains and adopt a product-centric way of thinking. 

Each domain will capture data at source or ingest it from upstream. Upstream documentation and standards from source will of course help define the target domain. But owners of the domain should be thinking in terms of delighting their own data consumers, taking responsibility for documenting their metadata in a way that makes use of their data products seamless. If the data that you produce within a domain is to be consumed by others then you need a metadata and lineage strategy that describes the data and where it comes from. So the point to define metadata is first of all at the point of capture – at the point of immediate subject matter expertise. The next is at the point of integration  within the domain and the final point where data exits your domain – alongside the data footprint and data products that are exposed to consumers.

Each domain will therefore require an appropriate metadata strategy that delights tier consumers and maximises value added to the supply chain.