Understanding AI's Essence Through Solomonoff Induction: The Science of Generalization Through Compression

Beginning the Journey to Understanding AI Through Compression

Recently, while browsing X (formerly Twitter), I've noticed the term "Solomonoff induction" appearing frequently in certain AI circles. What particularly caught my attention was the discussion linking the success of unsupervised learning to "compression". It's intriguing how researchers like Dan Selsam refer to Solomonoff induction as an ideal form of intelligence, describing it as "the process of finding the smallest representation".

Additionally, discussions among OpenAI associates included the following perspective:

When asked "Why does unsupervised learning work?", the response was, "It's compression. The ideal intelligence is called Solomonoff induction. This essentially involves imagining and considering all possible worlds when uncertain about what world you exist in. In this approach, simpler worlds are considered more probable than complex ones. This is completely Bayesian, updating your perspective with each experience while holding all possibilities in mind. This way of thinking can be approximated by finding the shortest program capable of computing everything you've observed. One way to view what we're doing in pre-training, or one interpretation of it, is precisely this compression."

These discussions strongly stimulated my intellectual curiosity. As I detailed in Modeling Uncertainty Through Bayesian Updates, I have an affinity for Bayesian theory. I'm also interested in Bayesian Optimization of "Exploration and Exploitation", and I recognized a connection to these ideas.

However, when first encountering concepts like Solomonoff induction and Kolmogorov complexity, I faced the challenge of fully understanding their theoretical significance and practical value. How do these concepts provide insights for AI development, and how can they be applied to modern machine learning? I felt the need to systematically organize these ideas, including their relationships with existing conceptual frameworks like Bayesian theory and Occam's razor.

In this article, I aim to understand what "Solomonoff induction" is from an AI engineer's perspective and explore why it's receiving attention now. I want to delve into the mystery of unsupervised learning through the lens of "compression" and "generalization". Being still in the learning process myself, my explanations may be incomplete in some aspects. Nevertheless, I hope this article serves to deepen the reader's understanding.

What is Solomonoff Induction: An Intuitive Understanding

When first hearing the term "Solomonoff induction," it might seem intimidating. However, its core is surprisingly simple: "Finding the shortest program leads to the best prediction."

The Shortest Program Perspective

For an intuitive understanding of Solomonoff induction, let's consider a concrete example.

If asked to complete the following sequence, what would come next?

2, 4, 6, 8, ?

Most people would answer "10". Why? Because the "rule of increasing by 2" is the simplest explanation. Of course, infinitely many answers are possible. For instance, a rule generating the sequence "2, 4, 6, 8, 123, ..." also exists. However, such a rule is complex and requires more information (a longer program) to explain.

Solomonoff induction is the mathematical formalization of this human intuition to "choose the simplest explanation." It's a theory that finding the shortest program (= the simplest explanation) that can generate the observed data is the best method for predicting unknown data.

In the context of machine learning, this translates to the idea that "the model that can most compactly compress data will have the best generalization performance."

Mathematical Formalization of Occam's Razor

"Solomonoff induction" can be viewed as the mathematical and computational formalization of "Occam's razor" (the principle that among multiple hypotheses, one should select the simplest).

While Occam's razor is intuitively understandable, the quantification of "simplicity" remained ambiguous. Solomonoff induction provided mathematical rigor to this principle by defining "simplicity" as "the length of the program that executes that explanation on a computer."

Why It's Being Reevaluated in the LLM Era

So why is this 60-year-old theory receiving attention now? The reason is deeply connected to the success of Large Language Models (LLMs).

The pre-training of LLMs is conducted through the simple task of "predicting the next word" from large volumes of text data. While this might appear to be mere statistical pattern learning, this process can actually be viewed as data compression.

To perform efficient compression, one needs to capture the structures and patterns behind the data. These include "programming language grammar," "physical laws," and "human social norms". The reason LLMs appear to have acquired such advanced knowledge may be a byproduct of efficiently compressing text.

The intuition that "compression" is "generalization" has emerged as the underlying principle behind LLMs' remarkable capabilities. And this idea is surprisingly consistent with the theory that Solomonoff induction proposed decades ago.

Theoretical Foundations of Solomonoff Induction

Let's now look more deeply at the theoretical background of Solomonoff induction. Rather than delving into complex mathematics, I'll focus on the essentials that AI engineers should understand.

Algorithmic Probability and Universal Prior Probability

At the center of Solomonoff induction is the concept of "algorithmic probability". This associates the probability of generating a string or data with the length of the program that can produce it.

Specifically, it is defined as the probability that random bits input to a universal Turing machine (UTM, essentially a computer model) will by chance form a program that outputs the desired data.

A key point of this approach is that shorter programs have a higher probability of being generated by chance. For example, the probability of a 10-bit program being generated by chance is $2^{-10}$ , but the probability for a 100-bit program is $2^{-100}$ , which is significantly smaller.

This characteristic of "prioritizing shorter programs" creates a bias in Solomonoff induction toward "favoring simpler hypotheses". Remarkably, it has been mathematically proven that this bias toward simplicity maximizes the accuracy of predictions for unknown data¹.

Kolmogorov Complexity and the Essence of Information

A concept closely related to Solomonoff induction is "Kolmogorov complexity", defined as the length of the shortest program that can generate given data. Formally, it is expressed as:

K_U(x) = \min_{p: U(p)=x} l(p)

Here, $U$ is the reference universal Turing machine, $p$ is a program, $l(p)$ is the length of program $p$ , and $U(p)=x$ means that $U$ outputs $x$ when executing $p$ .

This value, represented as $K(x)$ , serves as a measure of the essential information content or complexity of data $x$ . Consider these examples:

The Kolmogorov complexity of aaaaaaaaaaaaaaaaaaaa (20 a's) is low (it can be generated by a program that says "repeat a 20 times")
The Kolmogorov complexity of a disordered string q2k9z7r5p3m1l8n6o4 is high (you'd need to essentially hardcode the string itself)

The invariance theorem demonstrates an important property. For different universal Turing machines $U$ and $V$ , the following relationship holds:

|K_U(x) - K_V(x)| \leq C_{UV}

Here, $C_{UV}$ is a constant that depends on $U$ and $V$ but not on $x$ . This means that even if the computational model changes, the relative ordering of complexity remains the same.

Interestingly, Kolmogorov complexity is proven to be non-computable. However, even though it's non-computable, it serves as a powerful concept in the form of theoretical frameworks and approximations.

Connection to Bayesian Inference

Solomonoff induction is essentially positioned within the Bayesian inference framework. Let's recall Bayes' theorem:

P(H|D) = \frac{P(D|H) \times P(H)}{P(D)}

Where each term means:

$P(H|D)$ : The probability of hypothesis $H$ given data $D$ (posterior probability)
$P(D|H)$ : The probability of data $D$ being generated under hypothesis $H$ (likelihood)
$P(H)$ : The prior probability of hypothesis $H$
$P(D)$ : The marginal probability of data $D$ (normalization constant)

In Solomonoff induction, the prior distribution $P(x)$ is formally defined as:

P(x) = \sum_{p: U(p)=x} 2^{-l(p)}

Where $U(p)$ is the output when program $p$ is executed, and $l(p)$ is the length of program $p$ . This sum is calculated across all programs that output $x$ .

Given a data sequence $x_{1:n} = (x_1, x_2, ..., x_n)$ , the prediction probability for the next data point $x_{n+1}$ is calculated as:

P(x_{n+1}|x_{1:n}) = \frac{P(x_{1:n}, x_{n+1})}{P(x_{1:n})}

In this framework, hypothesis $H$ is "a program that generates the data", and its prior probability $P(H)$ is assigned based on the length of the program ( $P(H) \propto 2^{-l(H)}$ ).

The relationship with Kolmogorov complexity $K(x)$ is also important, with the following relationship between $P(x)$ and $K(x)$ :

-\log_2 P(x) = K(x) + O(1)

This means that data with higher probability has lower Kolmogorov complexity (= can be generated by shorter programs).

As a result, among all programs that match the observed data, the shortest one naturally has a higher posterior probability. Using this probability distribution to predict future data is the essence of Solomonoff induction.

Significance for AI Engineers

Why is this important for AI engineers? It can be summarized in the following points:

Fundamental Principle of Generalization: The essential goal of machine learning is generalization, and Solomonoff induction shows its theoretical limit
Theoretical Basis for "Compression = Learning": Understanding that data compression and learning are essentially the same process provides guidance for algorithm design
Understanding the Role of Bias: A theoretical explanation of why bias toward simplicity leads to good generalization
Recognition of Learning Limitations: Recognizing that complete Solomonoff induction is non-computable, and actual AI systems are merely approximations of it

Compression and Generalization: Solving the Mystery of Unsupervised Learning

The question "Why does unsupervised learning work?" is a fundamental query related to the core of AI. The key to solving this mystery lies in the concept of "compression".

Machine Learning as "Compression"

Machine learning, especially unsupervised learning, can be viewed from the perspective of "data compression". To perform efficient data compression, one needs to capture the structures and patterns behind the data. Consider these examples:

Image compression captures visual patterns such as correlations between pixels, edges, and textures
Text compression utilizes word frequencies, grammatical structures, and semantic relationships
Audio compression leverages frequency characteristics and temporal patterns

Extracting and utilizing these patterns is essentially what "learning" is.

The relationship between compression and learning is bidirectional:

Good learning → Enables efficient compression
Efficient compression → Implies good pattern recognition (= learning)

This perspective originated with Claude Shannon, the founder of information theory, and continues to influence modern deep learning research².

Why Unsupervised Learning Works

Let's consider the mystery of unsupervised learning, particularly LLM pre-training, from the perspective of "compression".

In LLM pre-training, the task of predicting the next token (word or part of a word) is used. For example, predicting "sunny" for the input "Today's weather is ____".

This next-token prediction essentially corresponds to data compression. This is because being able to efficiently predict tokens means capturing the redundancies (patterns) contained in the text³.

Ilya Sutskever (OpenAI Chief Scientist) has expressed the view that "compression equals generalization"⁴. According to this idea, a model that can efficiently compress data has captured the essential structure of that data, and therefore can make good predictions (generalization) for unseen data.

Understanding LLM Pre-training from a "Compression" Perspective

Viewing LLM pre-training as "compression" provides several interesting insights:

Automatic Acquisition of Language Structure: To achieve efficient prediction (= compression), the model naturally learns grammar rules and common-sense knowledge
Necessity of Large-scale Data: Better compression requires observing more patterns
Explanation of Scaling Laws: Increasing model size allows capturing more complex patterns, improving compression (= prediction) performance
Understanding Emergent Abilities: With sufficient compression capability, higher-order abilities (like reasoning) that weren't explicitly trained can be acquired

OpenAI researchers' description of their pre-training as "an attempt to find the shortest program that explains all the data humans have produced" shows they're viewing LLMs from this perspective of Solomonoff induction and compression.

Compression and Prediction: Information-Theoretical Connection

Viewing the relationship between compression and prediction from an information theory perspective provides even deeper understanding. Formally, a good predictor is a good compressor, and vice versa.

According to Shannon's information theory, the expected minimum number of bits required to encode a message following a probability distribution $P$ equals the entropy of that distribution, $H(P)$ .

H(P) = -\sum_x P(x) \log_2 P(x)

When using a probability distribution $Q$ to encode messages from distribution $P$ , the expected number of bits required becomes the cross-entropy $H(P,Q)$ .

H(P,Q) = -\sum_x P(x) \log_2 Q(x)

Here, $H(P,Q) \geq H(P)$ , with equality holding only when $P=Q$ . In other words, prediction distributions closer to the true distribution enable more efficient compression.

For example, "arithmetic coding" is a compression technique that concretely implements this principle. It performs encoding based on probability predictions for the next symbol, with higher prediction accuracy improving compression ratio. For a sequential data $x_1, x_2, ..., x_n$ , the code length using arithmetic coding with probability model $P$ approaches:

\text{Code length} \approx -\log_2 P(x_1, x_2, ..., x_n) = -\sum_{i=1}^n \log_2 P(x_i|x_1, ..., x_{i-1})

Conversely, the task "predict the next token" can be considered equivalent to the task "compress the text as much as possible"⁵. This is because a model's loss function (cross-entropy) is expressed as:

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T_i} \log P_\theta(x_{i,t}|x_{i,<t})

which directly relates to the average code length when encoding data using model $P_\theta$ .

This is no coincidence but is rooted in the fundamental principles of information theory. The lower the uncertainty (entropy), the less information is required, enabling more efficient compression. Good prediction reduces uncertainty, which directly leads to efficient compression.

Evolution of Solomonoff Induction: From Past to Future

To understand the historical development and modern significance of Solomonoff induction, let's trace its temporal evolution.

Evolution of Solomonoff Induction

1956

Birth of AI

The field of AI is formally founded at the Dartmouth Conference. Ray Solomonoff also participates in this conference.

1960-1964

Proposal of Solomonoff Induction

Ray Solomonoff publishes a series of papers proposing a theory of universal inductive inference.

circa 1965

Development of Kolmogorov Complexity

Andrey Kolmogorov independently develops a measure of complexity (Kolmogorov complexity).

1990s

Popularization of Minimum Description Length Principle

The Minimum Description Length (MDL) principle, related to Solomonoff induction, gains attention in the machine learning field.

circa 2000

Emergence of the AIXI Model

Marcus Hutter presents the AIXI model that integrates Solomonoff induction with reinforcement learning. It attracts attention as a theoretical framework for universal AI.

Late 2010s

Rise of Large Language Models

LLMs rapidly develop with the success of Transformer architecture and large-scale unsupervised pre-training.

2022-2024

Connection Between LLMs and Solomonoff Induction

Discussion linking LLM success to 'compression' and citing Solomonoff induction as a theoretical basis becomes active.

March 2025

Proposal of the Kolmogorov Test

The 'Kolmogorov test', using compression through code generation to evaluate LLM capabilities, is proposed.

History and Development of the Theory

Solomonoff induction was proposed in the early 1960s by Ray Solomonoff (one of the founders of the AI field). He attempted to construct "a completely general and formal theory of inductive reasoning not limited to specific domains"⁶.

Interestingly, Solomonoff induction was not a mainstream idea in the early AI community. At that time, AI was primarily centered on symbolic processing approaches, and probabilistic methods didn't receive much attention⁷.

Around 2000, Marcus Hutter extended Solomonoff induction and combined it with reinforcement learning to propose the "AIXI model". This was a theoretical model of "universal AI" capable of optimal decision-making in any environment (though non-computable)⁸.

Connection with Bayesian Optimization and My Research

As someone who chose Bayesian optimization as the theme of my master's thesis, I have a strong interest in this area. And there are interesting connections between Solomonoff induction and Bayesian optimization.

At the core of Bayesian optimization is the concept of "balancing exploration and exploitation", a method to efficiently find optimal solutions with limited trials. Considering its theoretical limit, it conceptually resonates with the process of "exploring the space of possible hypotheses and identifying the best explanation" in Solomonoff induction.

The three points they share are:

Modeling Prior Knowledge: Representing uncertainty as probability distributions
Evidence-based Updates: Updating distributions based on new data
Bias Toward Simplicity: Prioritizing simpler explanations

These principles can be applied not only to machine learning but also to human decision-making and learning processes. They bridge both aspects of my research (Bayesian optimization and Solomonoff induction).

Connection Points with Modern AI Research

In modern AI research, particularly LLM research, Solomonoff induction has several important connection points:

Theoretical Explanation of Scaling Laws: Performance improvements with increased model size can be explained as the ability to capture more complex patterns (= better compression)
Foundation for Self-supervised Learning: Providing a theoretical explanation of why self-supervised learning like next-token prediction produces deep understanding
Analysis of Emergent Abilities: Explaining the principle by which higher-order abilities not explicitly learned emerge with sufficient compression capacity
New Perspectives on Model Evaluation: Development of new metrics based on the principles of Solomonoff induction, such as "compression through code generation"

A hypothesis currently drawing particular attention is that "the Transformer architecture might be a good approximation of Solomonoff induction"⁹. Its ability to capture long-range dependencies and predict based on complex contexts might be implementing an implicit mixture of many simple "programs".

Practical Applications and Limitations

Let's consider how to apply Solomonoff induction theory to actual AI development and applications, as well as its limitations.

Overcoming the Wall of Non-computability

The greatest challenge of Solomonoff induction is that it is non-computable. This limitation stems mainly from three points:

Non-computability of Kolmogorov Complexity: No algorithm exists that can calculate the true complexity of given data
Infinite Program Space: Need to consider all possible programs
Halting Problem: Cannot generally determine whether a program will halt

These theoretical barriers mean that implementing complete Solomonoff induction is fundamentally impossible. However, this doesn't prevent the development of practical approximation methods.

Actual approximation approaches include:

Computational Resource Limitations: Setting upper limits on execution time or length of programs considered (such as approximations of the AIXI model)
Hypothesis Space Limitations: Restricting to specific model classes (like neural networks)
Practical Compression Algorithms: Using practical compression algorithms instead of theoretical Kolmogorov complexity
Meta-learning Approaches: Training models to approximate the behavior of Solomonoff induction by experiencing diverse prediction tasks

Influence on AI System Design

The perspective of Solomonoff induction can influence actual AI system design in the following ways:

Emphasis on Unsupervised Pre-training: Designs based on compression learning from large volumes of data
Explicit Introduction of Bias Toward Simplicity: Use of regularization or structured architectures
Understanding the Value of Diverse Experience: Importance of learning broad knowledge rather than specific domains
New Metrics for Model Evaluation: Using compression efficiency to evaluate model capabilities

For example, in LLM design, one could consider pre-training methods that explicitly tackle compression tasks rather than mere next-token prediction. Research is also progressing on evaluating how efficiently a model's internal representations compress data¹⁰.

Latest Research Trends (2024-2025)

Four points of particular note in the latest research trends from 2024 to early 2025 are:

Kolmogorov Test (KT): A benchmark using "compression through code generation" to evaluate LLM capabilities. It evaluates reasoning and compression abilities by having LLMs generate the shortest program that outputs a given data sequence¹¹.
Explicit Approximation of Solomonoff Induction: Research attempting to explicitly reproduce the characteristics of Solomonoff induction by training neural networks from universal data¹².
Relationship Between Transformers and Solomonoff Induction: Research theoretically analyzing why the Transformer architecture might be a better approximation of Solomonoff induction than other architectures¹³.
Ethical Implications of Prior Probabilities: Discussion on how the universal prior probability in Solomonoff induction affects AI decision-making and values¹⁴.

My Perspective as an AI Engineer

As an AI engineer rather than a theoretical researcher, my practical insights from Solomonoff induction are as follows:

Utility of the "Compression" Perspective: The perspective of "how efficiently can the model compress data" provides a very practical guideline in designing and evaluating machine learning models
Importance of Bias Toward Simplicity: Provides theoretical grounds for prioritizing simplicity in model design to prevent overfitting and achieve good generalization
Recognition of the Gap Between Theory and Practice: Even if theoretically optimal algorithms are non-computable, practical approaches inspired by their insights are valuable
Balance Between Exploration and Exploitation: The principle of exploring the hypothesis space as widely as possible while concentrating resources on promising hypotheses is practically important even with finite computational resources

Summary and Outlook

We've seen how the seemingly complex concept of "Solomonoff induction" actually provides an important perspective for understanding modern AI, especially unsupervised learning.

The Essence of Unsupervised Learning from a Compression Perspective

One answer to the question "Why does unsupervised learning work?" lies in the concept of "compression". The view that efficient compression necessarily leads to good generalization provides a powerful framework that consistently explains various phenomena in deep learning.

Solomonoff induction gives this idea a theoretical foundation, mathematically formalizing the principle that "finding the shortest program leads to the best prediction". This can be seen as formalizing the intuition that "compression is generalization".

My Perspective as an AI Engineer and Future Directions

As an AI engineer, these are the four points I'd like to focus on in the future:

Research on Architectures that Enhance Compression Efficiency: The perspective of how new architectures beyond Transformers achieve more efficient data compression
Integration of Unsupervised Learning and Reinforcement Learning: Development of frameworks that integrate prediction and decision-making, like the AIXI extension of Solomonoff induction
Models that Perform Efficient Compression Even at Small Scale: Model designs that can perform efficient compression learning even without large computational resources
Standardization of Compression Capability Evaluation Metrics: Development of benchmarks like the Kolmogorov test that directly evaluate models' compression capabilities

The Beginning of an Intellectual Journey

In this article, I've explored the essence and modern significance of "Solomonoff induction", a term I encountered on X. While it has profound theoretical backgrounds, this concept with intuitive cores like "prioritize shorter programs" and "compression generates generalization" can serve as practical guidelines for AI engineers.

The intellectual curiosity journey that begins with encountering unfamiliar terms on X and leads to exploration always brings new discoveries. I hope this journey of capturing the essence of AI from the perspective of "compression" has provided you with some compression (= a new framework of understanding) in your thinking as well.

References

Hutter, M. (2007). "A Theory of Universal Artificial Intelligence based on Algorithmic Complexity." arXiv:cs/0004001. ↩
Shannon, C.E. (1948). "A Mathematical Theory of Communication." The Bell System Technical Journal, 27(3), 379-423. ↩
Brown, T.B., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33. ↩
Sutskever, I. (2022). "How do we get to AGI from here? The role of RL, self-supervised learning, and meta-learning." Stanford HAI Seminar. ↩
Deletang, G., et al. (2023). "Language Modeling Is Compression." arXiv:2309.10668. ↩
Solomonoff, R.J. (1964). "A formal theory of inductive inference. Part I." Information and Control, 7(1), 1-22. ↩
Sterkenburg, T.F. (2016). "Solomonoff Prediction and Occam's Razor." Philosophy of Science, 83(4), 459-479. ↩
Hutter, M. (2000). "A Theory of Universal Artificial Intelligence based on Algorithmic Complexity." arXiv:cs/0004001. ↩
Akrout, M., et al. (2024). "Transformers As Approximations of Solomonoff Induction." arXiv:2408.12065. ↩
Deletang, G., et al. (2023). "Neural Networks as Compression: Insights from Deep Learning's Local Function Approximation." arXiv:2306.14747. ↩
Colby, L., et al. (2025). "The KoLMogorov Test: Compression by Code Generation." arXiv:2503.13992. ↩
Scherlis, A., et al. (2024). "Learning Universal Predictors." arXiv:2401.14953. ↩
Akrout, M., et al. (2024). "Transformers As Approximations of Solomonoff Induction." arXiv:2408.12065. ↩
Garrabrant, S., et al. (2023). "Concerns about the Solomonoff prior." Alignment Forum. ↩

AI Technology

Machine Learning Theory

Solomonoff Induction

LLM

Unsupervised Learning

Data Compression

Understanding AI's Essence Through Solomonoff Induction: The Science of Generalization Through Compression

Beginning the Journey to Understanding AI Through Compression