Artificial intelligence has dazzled us with its predictive prowess. From composing text to predicting protein structures, today’s large language models (LLMs) and other foundation models have begun to resemble the astronomer Johannes Kepler: brilliant at describing patterns, but not necessarily explaining why they work. The question now facing researchers is: can these systems ever take the Newtonian leap — from surface-level predictions to deep, generalizable understanding of the world?
A new study by researchers at MIT’s Laboratory for Information and Decision Systems (LIDS) and Harvard University takes aim at this puzzle. Presented at the International Conference on Machine Learning in Vancouver, the work by Keyon Vafa (Harvard), Peter G. Chang (MIT), Ashesh Rambachan (MIT), and Sendhil Mullainathan (MIT) proposes a new way to measure whether AI systems are merely good guessers or whether they have developed something closer to a world model.
From Predictions to Principles
“Humans all the time have been able to make this transition from good predictions to world models,” says Vafa, lead author. Newton didn’t just match Kepler’s orbital data — he showed why planets move as they do, and his laws applied to countless new scenarios. By contrast, LLMs and similar AI systems tend to excel at narrow tasks but falter when asked to transfer that knowledge across domains.
Mullainathan, the study’s senior author, frames the challenge simply: “We know how to test whether an algorithm predicts well. But what we need is a way to test for whether it has understood well.” Even defining “understanding,” he admits, is not straightforward.
The Test of Inductive Bias
The researchers introduce a new metric: inductive bias — the degree to which a system’s predictions align with the true structure of the world. If a model consistently infers the right underlying dynamics from raw data, even in unfamiliar situations, it is showing signs of understanding.
To probe this, the team designed a series of tests. One involved simple “lattice models” — abstract worlds where movement happens along a line or grid, like a frog hopping across lily pads. With limited complexity, predictive AI could reconstruct the hidden structure well. But as dimensions and states increased, accuracy collapsed.
Another test used the board game Othello. Models trained on move sequences could predict legal moves but struggled to infer the overall board state, especially for hidden or blocked pieces. In other words: prediction was strong, comprehension weak.
Across five categories of real-world predictive systems, the same trend emerged: more complexity meant less evidence of true world modeling.
Why It Matters
This matters because AI is already being applied in high-stakes domains — from drug discovery to physics simulations. If a system can’t generalize, then its dazzling predictions may not withstand the messy complexity of reality. “For more realistic problems, even for something like basic mechanics, we found that there seems to be a long way to go,” Vafa notes.
Yet the researchers see promise in their new metric. By quantifying inductive bias, they provide a testbed for evaluating and improving foundation models. “As an engineering field, once we have a metric for something, people are really, really good at optimizing that metric,” says Chang.
The Road Ahead
The work tempers the hype surrounding foundation models. Building ever larger models trained on oceans of data won’t automatically yield Newton-like insights. Instead, careful measurement, evaluation, and new training paradigms may be needed to nudge AI toward genuine world modeling.
The Kepler-to-Newton analogy captures the stakes: prediction without understanding can be useful, but understanding unlocks universality. Whether AI will ever achieve that leap remains an open question. For now, the verdict is clear: our most advanced models are still brilliant Keplers — not yet Newtons.