Ontologies for Autonomous Scientific Discovery

The foundations behind billion-dollar bets on AI scientists.

Cover

TL;DR → Ontologies may become the key for autonomous research agents discovering new materials and drugs

— Based on my talk at the Semantic Materials Workshop 2025 in Cambridge, UK.
— Also published on Substack.

From Manual Extraction to Machine Prediction

Is it possible that ontologies and semantically aligned databases will be the key for agentic AI scientists of the future? Will the large effort undergone by ontologists of the past pay off to enable automated AI-driven scientific discovery?

Let’s look at how the data-driven AI-based scientific discovery process works, on the example of material science. Only a couple of years ago, as far back as 2021 (before the big emergence of LLMs) this is how we did it.

Let’s say you want to predict the properties of a new material. First you need to know the properties of all the existing materials in the world, or, at least of similar materials already discovered. Then, based on similarity of your new material to the materials already assessed, you would be able to predict the properties of your new material. So, in order to predict the value for any physical or chemical property of a material it is good to start with a database of properties of already known materials. What do you do if you don’t have that database? You start with raw scientific articles. You take hundreds, thousands, or even millions of scientific articles published in the past, and read them all (obviously, with the help of a machine) to create your database. This is what we did in 2021. We developed a special software package to do it, and we extracted thousands of properties for thousands of different compounds. In particular, we extracted properties of crystalline materials. Actually, small graphlets of interconnected properties - and that was a novelty at the time.¹

For every compound, we would extract all the properties simultaneously, with a precision of 92 % for some extractions, back in pre-LLM time. This was a proof-of-concept. The real power of the approach presented itself with materials for which there were no open databases widely available. Amorphous materials were a good candidate. So, we extracted thousands of glass transition temperatures (a property of amorphous materials) from the raw scientific literature. Armed with a large dataset, we were able to predict glass transition temperatures for entirely unknown compounds, using traditional machine learning methods.

This example demonstrates the lifecycle of data-driven materials discovery (without the part of experimental validation and synthesis). Many of the latest cutting edge companies would go a step further, which is to synthesize the compounds showing promising predictions. They would have robots to do that, set up in autonomous robotic laboratories. Thus, going from unstructured data, to structured data, to predictions, to synthesis and, finally, experimental validation (and patenting) - the complete end-to-end scientific discovery pipeline.

The Role of Ontologies and Semantics

How far can the end-to-end scientific discovery workflow take us? Where would it fail?

In the example given above, we were extracting data, and populating mini ontologies. What we wanted to extract was fairly simple. However, once you start extracting millions of different types of entities, the meaning of what you are extracting gets a little bit mixed up. For example, a jaguar could be a car or an animal. It sounds simple to tell the difference, based on context, but once you start working with more and more entities, and larger and larger datasets, and if you didn’t have an ontology describing the semantics of your data, concepts would be duplicated and meaning blurred. How would you be able to do your ML-based predictions, without clean labels on the data? Without clarity about what you are actually predicting and what these predictions could, in principle, be based on?

An ontology defines the semantics of entities and their relationships in a rigorous way. There are many ontologies out there, some domain-specific, others general. If you want to make meaningful predictions on your structured dataset, it is beneficial if the dataset is structured according to an ontology, that is, semantically aligned. The better it aligns, the cleaner the labels are defined, and the better you will be able to make meaningful predictions.

From Graphs to Reasoning

Once you have your data, stored in a clean, semantically aligned graph, you can choose what to do with it. On the one hand, you can attach it to an autonomous LLM-based AI agent in a GraphRAG approach (or similar). This basically just queries the graph to extract information. On the other hand, you can directly apply graph-based algorithms on the dataset (graph machine learning). For example, you can mathematically encode relationships between entities (geometric representation learning). You can predict types of nodes, links between entities, or types of relationships, clusters, and much more (clearly, many use-cases for scientific discovery).

Emerging Frameworks: Honeycomb, SciAgents, Kosmos

If we move back to the field of automated materials discovery, we see that it is developing very fast. Two recent approaches are particularly interesting, Honeycomb² (A Flexible LLM-Based Agent System for Material Science), and, SciAgents³ (Automating Scientific Discovery Through Multi-Agent Intelligent Graph Reasoning), primarily for bio-inspired materials. In SciAgents, the automated workflow is conceptually similar to the manual workflow I described above. Ghafarollahi et al. start with unstructured data sets (papers). They then create a global knowledge graph, which they connect to a series of AI agents to make the scientific discovery.

Figure 2. The HoneyComb system for materials science. Agentic tools are combined with materials science knowledge bases. Figure copied from Zhang et al. https://arxiv.org/abs/2409.00135

SciAgents — Figure 3. A series of agents works iteratively to make a scientific discovery. The starting knowledge graph is obtained separately from primary scientific literature. Figure copied from Ghafarollahi et al. https://arxiv.org/abs/2409.05556

Another interesting framework published recently is Kosmos by Mitchener et al.⁴ Therein, existing discoveries from papers are linked and analysed using AI agents, in a structured way, leading to the automated build-up of a world model of the field of research in question. The world model includes discoveries, conclusions, results of automated analysis, and represents the thinking process eventually leading to the output (new discoveries). It connects all the dots.

Billion-Dollar Bets: The Race for Autonomous Research Agents

In recent time, autonomous research agents have emerged as billion-dollar bets. To name a few:

Lila Sciences - initial investment of ~$550 million,
Periodic Labs - initial investment of ~$300 million,
Emerald Cloud Lab - initial investment of ~$100 million,
Kebotix - initial investment of ~$23 million,
Dunia Innovations - initial investment of ~$11 million,
LabGenius - initial investment of ~$70 million, and many more.

Some of these companies focus on the full process, from automated materials discovery to experimental synthesis, using automated robotics labs. The promise for most of them is very similar:

Developing a good world model for the domain of interest,
Synthesis in closed-loop robotic labs,
Generate proprietary experimental data at scale (patentable).

Outlook: Ontologies Meet Autonomy

It remains to be seen how autonomous research agents will perform in a more unconstrained environment, where more general scientific discoveries will be attempted, outside of a narrow domain such as the discovery of a new material within a narrow class of materials. Ontologies can help constrain very general observations.

With American drive in pure AI and European data sources in combination, a lot could be done for the world of autonomous research agents. Strong governance with rigorous semantic foundations might prove essential.

J. Mavračić, C. J. Court, T. Isazawa, S. R. Elliott, and J. M. Cole, ‘ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science’, J. Chem. Inf. Model., vol. 61, no. 9, pp. 4280–4289, Sept. 2021, doi: 10.1021/acs.jcim.1c00446. ↩ ↩²
H. Zhang, Y. Song, Z. Hou, S. Miret, and B. Liu, ‘HoneyComb: A Flexible LLM-Based Agent System for Materials Science’, in Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA: Association for Computational Linguistics, 2024, pp. 3369–3382. doi: 10.18653/v1/2024.findings-emnlp.192. ↩
A. Ghafarollahi and M. J. Buehler, ‘SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning’, Sept. 09, 2024, arXiv: arXiv:2409.05556. doi: 10.48550/arXiv.2409.05556. ↩
L. Mitchener et al., ‘Kosmos: An AI Scientist for Autonomous Discovery’, Nov. 05, 2025, arXiv: arXiv:2511.02824. doi: 10.48550/arXiv.2511.02824. ↩

Ontologies for Autonomous Scientific Discovery

From Manual Extraction to Machine Prediction

The Role of Ontologies and Semantics

From Graphs to Reasoning

Emerging Frameworks: Honeycomb, SciAgents, Kosmos

Billion-Dollar Bets: The Race for Autonomous Research Agents

Outlook: Ontologies Meet Autonomy

Agentic AI: From Embodiment to Multi-Agent Systems

Knowledge Graph Symposium 2025

Semantic Materials Workshop 2025

From Manual Extraction to Machine Prediction

The Role of Ontologies and Semantics

From Graphs to Reasoning

Emerging Frameworks: Honeycomb, SciAgents, Kosmos

Billion-Dollar Bets: The Race for Autonomous Research Agents

Outlook: Ontologies Meet Autonomy

Footnotes

Agentic AI: From Embodiment to Multi-Agent Systems

Knowledge Graph Symposium 2025

Semantic Materials Workshop 2025