pablo formoso FUTURE / DATA & AI
ES EN Streaming –:–:– UTC

LinkML: A Shared Skeleton So Your Data, Your Models and Your Agents Speak the Same Language

The human musculoskeletal system works because there’s a skeleton underneath. Modern data and AI stacks have the same problem: many muscles —Pydantic, JSON Schema, SQL, GraphQL, RDF— pulling with no bone underneath. LinkML proposes that bone.

🌐 This is an automatic translation of the original post in Spanish. Some nuances may have been lost along the way.

The human musculoskeletal system works because there’s a skeleton underneath. Without it, the muscles would pull in contradictory directions and the whole thing would collapse onto itself. Modern data and AI stacks have exactly that problem: many muscles —Pydantic, JSON Schema, SQL, GraphQL, RDF— pulling with no bone underneath. LinkML proposes that bone.

The underlying problem: four versions of the same data

If you’ve worked on any moderately serious architecture in recent years, this will sound familiar. You have an entity —say, Customer— and it ends up living in four different places:

  • A Pydantic class in the backend to validate input.
  • A JSON Schema in your API documentation and in your MCP tool definitions.
  • A SQL DDL in the database.
  • And, if you’re unlucky, a semantic vocabulary (RDF, OWL, JSON-LD) for some consortium or client that demands interoperability.

The four versions start out identical and, by month six, they’ve diverged. Someone renamed a field in Pydantic and forgot to update the DDL. The MCP tool returns a JSON the downstream agent can’t parse because the schema is out of date. It’s pure entropy: the system tends towards disorder unless you invest constant energy keeping it aligned.

It’s like a body whose bones grow at different rates. It eventually becomes unviable.

What is LinkML, in one sentence?

LinkML (Linked data Modeling Language) is a declarative, YAML-based language where you define your model once and it compiles to more than 30 formats: Pydantic, JSON Schema, SQL DDL, GraphQL, Protocol Buffers, TypeScript, Java, Rust, OWL, SHACL, JSON-LD, Mermaid diagrams, HTML docs… and the list doesn’t end there.

It’s the shared skeleton. The muscles stay the same —every language, every database, every API— but now they hang off something coherent.

Where this comes from (and why it matters)

LinkML wasn’t born yesterday in a startup with a Y Combinator demo. It comes from Lawrence Berkeley National Laboratory and the Monarch Initiative, a network of federated biomedical projects that have spent years obsessed with a very specific problem: getting a hundred different labs to publish data that can be combined without every combination costing a month of human work.

But —and this is the interesting part— it has outgrown its biomedical origins. Today it’s used by:

  • ENTSO-E (the European electricity grid).
  • NFDI (Germany’s national research infrastructure).
  • NIH Bridge2AI, iSamples, Alliance of Genome Resources and a long etcetera.

And the reference publication —Moxon et al., “LinkML: an open data modeling framework”— has just come out in GigaScience 2026. This is not a project in hibernation: the latest commit is 24 hours old.

Apache-2.0 license on the core, CC0 on the metamodel. All commercially usable without friction. In case you were wondering.

How it works: a single source of truth

You write something like this (YAML, readable even for a product manager of good will):

classes:
  Customer:
    description: A person or entity that buys
    slots:
      - id
      - name
      - email
      - sector

slots:
  id:
    identifier: true
    range: string
  email:
    range: string
    pattern: "^\S+@\S+\.\S+$"
  sector:
    range: SectorEnum

enums:
  SectorEnum:
    permissible_values:
      energy:
      healthcare:
      industry:

And from that YAML you generate, with one command, the equivalent versions in every format you need. You change a field in the YAML, recompile, and all the versions stay in sync. One truth, many masks.

Infographic: one LinkML schema, multiple outputs

The piece that changes the game for those of us working with LLMs: OntoGPT

This is where, for me, it gets really interesting.

OntoGPT —from the Monarch Initiative, BSD-3, 809 GitHub stars— implements a method called SPIRES (Structured Prompt Interrogation and Recursive Extraction of Semantics). The trick is elegant: you use your LinkML schema as an extraction contract. You give the LLM free text and ask it to extract structured information conforming to that schema. The output is automatically validated against the contract. If the model hallucinates a field that doesn’t exist, it gets discarded.

And best of all: it supports open models via ollama (Qwen, Llama, Mistral). That means the entire pipeline can run on-premise, inside your own DGX Spark or whatever hardware you have, without touching external APIs. If you care about sovereignty over your data —and I care more about it every day—, this is relevant.

It’s basically what many teams are reimplementing by hand with Pydantic + Instructor + ad-hoc LangChain parsers. Except here, on top of that, the schema is portable: your API, your database and your docs pipeline can consume it too.

Where it fits in a modern agentic architecture

Think of a typical stack today: LangGraph orchestrating agents, MCP tools for external actions, RAG for context, on-premise models for inference, and a couple of downstream APIs with their Pydantic.

Each of those components talks about the same entities —customers, documents, events— but each has its own representation. And every change propagates like a distributed ache across the whole system.

LinkML lets you have a single canonical contract that compiles to:

  • Pydantic for the agents and LangGraph state.
  • JSON Schema for the MCP tool definitions.
  • SQL DDL for persistence.
  • JSON-LD or OWL if at some point you expose the data as linked data (or if a public-sector client demands it for compliance).

The body stops fighting itself.

The dystopian angle: the other side

I’m going to be honest about the dark part, because there always is one.

Imagine a future where everything is described by a formal schema. Every human interaction, every decision, every concept. It sounds like a rationalist dream —and it is— but it also sounds like absolute control. If you’re the one defining the skeleton, you organize the body. If a closed consortium defines it, it organizes it for its own interests. Interoperability can be freedom or it can be capture, depending on who holds the pen.

LinkML, being open, Apache-2.0 and with European adoption, is on the bright side today. But the tool is neutral. What gets done with it is not. Worth keeping in mind when we think about data standards for the next decade.

When to use it and when not to

It’s not a silver bullet. My practical heuristic:

Infographic: LinkML or plain Pydantic?

Use LinkML when one or more of these conditions hold:

  • You have three or more consumers of the same model (API, DB, agents, another app).
  • There are explicit FAIR, governance or interoperability requirements.
  • The project has a long horizon (≥18 months).
  • You share the model with an external consortium or a client that demands formal semantics.

Stick with plain Pydantic when:

  • The model is internal, lives in a single service and nobody else consumes it.
  • The project is short and won’t evolve much.
  • Your team has no appetite for learning a new tool right now.

LinkML shines in projects with many mouths eating from the same plate. For one spoon and one plate, it’s still overkill.

Honest risks

So I don’t stay in “sales” mode:

  • The ecosystem’s biomedical bias: templates and examples are dominated by biological domains. Generalizing to industrial or business domains requires your own validation.
  • Semantic learning curve: concepts like IRI, JSON-LD context or SHACL can be intimidating. The good news is you can ignore them until you need them. The bad news is the documentation leans heavily on ontology vocabulary.
  • Moderate bus factor: the project lives mainly at LBNL/Monarch. It’s not the Apache Foundation. Worth keeping an internal mirrored fork.
  • OntoGPT with open models: SPIRES is validated against GPT-3/4. Its performance on Qwen or Mistral on-premise needs to be measured. Reasonable assumption: it works, but requires prompt tuning.

My practical recommendation

If you ask me what I’d do over the next two months with this, my answer has three small, cheap steps:

  1. A 1–2 sprint spike: take an internal pipeline that already has duplicated Pydantic + JSON Schema + SQLAlchemy and rewrite that common source in LinkML. Measure real friction.
  2. An OntoGPT + Qwen on-premise pilot: try the SPIRES pattern on a non-biomedical domain, with non-trivial structured extraction. Compare against a manual baseline.
  3. Document a one-page internal usage criterion. “When LinkML, when Pydantic?” — so the team doesn’t have to reinvent the judgment every time.

If the first two steps go well, LinkML stops being an academic curiosity and becomes a cross-cutting architectural piece. If they go so-so, we’ve lost six weeks and learned something. Asymmetric cost, controlled risk.

Closing

I’ve been thinking about the same thing for a few years now: data in organizations has a skeleton problem, not a muscle problem. We buy tools (muscles) constantly, but we still don’t have a common bone to hang them from. Every architecture decision repeats the same cognitive task: “how do I represent this entity?”, “in what format?”, “with what validation?”.

LinkML is not the only possible answer, but it’s one of the most serious I’ve seen. And, above all, it’s one of the few that comes from the world of open research, with a permissive license and European institutional adoption. That, at the moment we’re in —with growing pressure around data sovereignty—, is no small detail.

If your organization is building agents, LLM-based structured extraction or data platforms that have to talk to other systems, it’s worth a spike. At the very least, you’ll come out with a new mental map of how to separate a datum’s definition from its many representations.

And that, on its own, already changes quite a few things.


Sources

Pablo Formoso
author

Pablo Formoso

Field notes from the intersection of data, AI, and applied philosophy.

posts
29
from
2025

Leave a Reply

Your email address will not be published. Required fields are marked *