Technology

Understanding AI's Essence Through Solomonoff Induction: The Science of Generalization Through Compression

Unlocking the mystery of unsupervised learning through 'compression': A theoretical foundation reevaluated in the LLM era

2025-04-13
24 min
AI Technology
Machine Learning Theory
Solomonoff Induction
LLM
Unsupervised Learning
Data Compression
Ryosuke Yoshizaki

Ryosuke Yoshizaki

CEO, Wadan Inc. / Founder of KIKAGAKU Inc.

Understanding AI's Essence Through Solomonoff Induction: The Science of Generalization Through Compression

Beginning the Journey to Understanding AI Through Compression

Recently, while browsing X (formerly Twitter), I've noticed the term "Solomonoff induction" appearing frequently in certain AI circles. What particularly caught my attention was the discussion linking the success of unsupervised learning to "compression". It's intriguing how researchers like Dan Selsam refer to Solomonoff induction as an ideal form of intelligence, describing it as "the process of finding the smallest representation".

Additionally, discussions among OpenAI associates included the following perspective:

When asked "Why does unsupervised learning work?", the response was, "It's compression. The ideal intelligence is called Solomonoff induction. This essentially involves imagining and considering all possible worlds when uncertain about what world you exist in. In this approach, simpler worlds are considered more probable than complex ones. This is completely Bayesian, updating your perspective with each experience while holding all possibilities in mind. This way of thinking can be approximated by finding the shortest program capable of computing everything you've observed. One way to view what we're doing in pre-training, or one interpretation of it, is precisely this compression."

These discussions strongly stimulated my intellectual curiosity. As I detailed in Modeling Uncertainty Through Bayesian Updates, I have an affinity for Bayesian theory. I'm also interested in Bayesian Optimization of "Exploration and Exploitation", and I recognized a connection to these ideas.

However, when first encountering concepts like Solomonoff induction and Kolmogorov complexity, I faced the challenge of fully understanding their theoretical significance and practical value. How do these concepts provide insights for AI development, and how can they be applied to modern machine learning? I felt the need to systematically organize these ideas, including their relationships with existing conceptual frameworks like Bayesian theory and Occam's razor.

In this article, I aim to understand what "Solomonoff induction" is from an AI engineer's perspective and explore why it's receiving attention now. I want to delve into the mystery of unsupervised learning through the lens of "compression" and "generalization". Being still in the learning process myself, my explanations may be incomplete in some aspects. Nevertheless, I hope this article serves to deepen the reader's understanding.

What is Solomonoff Induction: An Intuitive Understanding

When first hearing the term "Solomonoff induction," it might seem intimidating. However, its core is surprisingly simple: "Finding the shortest program leads to the best prediction."

The Shortest Program Perspective

For an intuitive understanding of Solomonoff induction, let's consider a concrete example.

If asked to complete the following sequence, what would come next?

2, 4, 6, 8, ?

Most people would answer "10". Why? Because the "rule of increasing by 2" is the simplest explanation. Of course, infinitely many answers are possible. For instance, a rule generating the sequence "2, 4, 6, 8, 123, ..." also exists. However, such a rule is complex and requires more information (a longer program) to explain.

Solomonoff induction is the mathematical formalization of this human intuition to "choose the simplest explanation." It's a theory that finding the shortest program (= the simplest explanation) that can generate the observed data is the best method for predicting unknown data.

In the context of machine learning, this translates to the idea that "the model that can most compactly compress data will have the best generalization performance."

Mathematical Formalization of Occam's Razor

"Solomonoff induction" can be viewed as the mathematical and computational formalization of "Occam's razor" (the principle that among multiple hypotheses, one should select the simplest).

While Occam's razor is intuitively understandable, the quantification of "simplicity" remained ambiguous. Solomonoff induction provided mathematical rigor to this principle by defining "simplicity" as "the length of the program that executes that explanation on a computer."

Why It's Being Reevaluated in the LLM Era

So why is this 60-year-old theory receiving attention now? The reason is deeply connected to the success of Large Language Models (LLMs).

The pre-training of LLMs is conducted through the simple task of "predicting the next word" from large volumes of text data. While this might appear to be mere statistical pattern learning, this process can actually be viewed as data compression.

To perform efficient compression, one needs to capture the structures and patterns behind the data. These include "programming language grammar," "physical laws," and "human social norms". The reason LLMs appear to have acquired such advanced knowledge may be a byproduct of efficiently compressing text.

The intuition that "compression" is "generalization" has emerged as the underlying principle behind LLMs' remarkable capabilities. And this idea is surprisingly consistent with the theory that Solomonoff induction proposed decades ago.

Theoretical Foundations of Solomonoff Induction

Let's now look more deeply at the theoretical background of Solomonoff induction. Rather than delving into complex mathematics, I'll focus on the essentials that AI engineers should understand.

図表を生成中...

Algorithmic Probability and Universal Prior Probability

At the center of Solomonoff induction is the concept of "algorithmic probability". This associates the probability of generating a string or data with the length of the program that can produce it.

Specifically, it is defined as the probability that random bits input to a universal Turing machine (UTM, essentially a computer model) will by chance form a program that outputs the desired data.

A key point of this approach is that shorter programs have a higher probability of being generated by chance. For example, the probability of a 10-bit program being generated by chance is 2102^{-10}, but the probability for a 100-bit program is 21002^{-100}, which is significantly smaller.

This characteristic of "prioritizing shorter programs" creates a bias in Solomonoff induction toward "favoring simpler hypotheses". Remarkably, it has been mathematically proven that this bias toward simplicity maximizes the accuracy of predictions for unknown data1.

Kolmogorov Complexity and the Essence of Information

A concept closely related to Solomonoff induction is "Kolmogorov complexity", defined as the length of the shortest program that can generate given data. Formally, it is expressed as:

KU(x)=minp:U(p)=xl(p)K_U(x) = \min_{p: U(p)=x} l(p)

Here, UU is the reference universal Turing machine, pp is a program, l(p)l(p) is the length of program pp, and U(p)=xU(p)=x means that UU outputs xx when executing pp.

This value, represented as K(x)K(x), serves as a measure of the essential information content or complexity of data xx. Consider these examples:

  • The Kolmogorov complexity of aaaaaaaaaaaaaaaaaaaa (20 a's) is low (it can be generated by a program that says "repeat a 20 times")
  • The Kolmogorov complexity of a disordered string q2k9z7r5p3m1l8n6o4 is high (you'd need to essentially hardcode the string itself)

The invariance theorem demonstrates an important property. For different universal Turing machines UU and VV, the following relationship holds:

KU(x)KV(x)CUV|K_U(x) - K_V(x)| \leq C_{UV}

Here, CUVC_{UV} is a constant that depends on UU and VV but not on xx. This means that even if the computational model changes, the relative ordering of complexity remains the same.

Interestingly, Kolmogorov complexity is proven to be non-computable. However, even though it's non-computable, it serves as a powerful concept in the form of theoretical frameworks and approximations.

Connection to Bayesian Inference

Solomonoff induction is essentially positioned within the Bayesian inference framework. Let's recall Bayes' theorem:

P(HD)=P(DH)×P(H)P(D)P(H|D) = \frac{P(D|H) \times P(H)}{P(D)}

Where each term means:

  • P(HD)P(H|D): The probability of hypothesis HH given data DD (posterior probability)
  • P(DH)P(D|H): The probability of data DD being generated under hypothesis HH (likelihood)
  • P(H)P(H): The prior probability of hypothesis HH
  • P(D)P(D): The marginal probability of data DD (normalization constant)

In Solomonoff induction, the prior distribution P(x)P(x) is formally defined as:

P(x)=p:U(p)=x2l(p)P(x) = \sum_{p: U(p)=x} 2^{-l(p)}

Where U(p)U(p) is the output when program pp is executed, and l(p)l(p) is the length of program pp. This sum is calculated across all programs that output xx.

Given a data sequence x1:n=(x1,x2,...,xn)x_{1:n} = (x_1, x_2, ..., x_n), the prediction probability for the next data point xn+1x_{n+1} is calculated as:

P(xn+1x1:n)=P(x1:n,xn+1)P(x1:n)P(x_{n+1}|x_{1:n}) = \frac{P(x_{1:n}, x_{n+1})}{P(x_{1:n})}

In this framework, hypothesis HH is "a program that generates the data", and its prior probability P(H)P(H) is assigned based on the length of the program (P(H)2l(H)P(H) \propto 2^{-l(H)}).

The relationship with Kolmogorov complexity K(x)K(x) is also important, with the following relationship between P(x)P(x) and K(x)K(x):

log2P(x)=K(x)+O(1)-\log_2 P(x) = K(x) + O(1)

This means that data with higher probability has lower Kolmogorov complexity (= can be generated by shorter programs).

As a result, among all programs that match the observed data, the shortest one naturally has a higher posterior probability. Using this probability distribution to predict future data is the essence of Solomonoff induction.

Significance for AI Engineers

Why is this important for AI engineers? It can be summarized in the following points:

  1. Fundamental Principle of Generalization: The essential goal of machine learning is generalization, and Solomonoff induction shows its theoretical limit
  2. Theoretical Basis for "Compression = Learning": Understanding that data compression and learning are essentially the same process provides guidance for algorithm design
  3. Understanding the Role of Bias: A theoretical explanation of why bias toward simplicity leads to good generalization
  4. Recognition of Learning Limitations: Recognizing that complete Solomonoff induction is non-computable, and actual AI systems are merely approximations of it

Compression and Generalization: Solving the Mystery of Unsupervised Learning

The question "Why does unsupervised learning work?" is a fundamental query related to the core of AI. The key to solving this mystery lies in the concept of "compression".

Machine Learning as "Compression"

Machine learning, especially unsupervised learning, can be viewed from the perspective of "data compression". To perform efficient data compression, one needs to capture the structures and patterns behind the data. Consider these examples:

  • Image compression captures visual patterns such as correlations between pixels, edges, and textures
  • Text compression utilizes word frequencies, grammatical structures, and semantic relationships
  • Audio compression leverages frequency characteristics and temporal patterns

Extracting and utilizing these patterns is essentially what "learning" is.

The relationship between compression and learning is bidirectional:

  • Good learning → Enables efficient compression
  • Efficient compression → Implies good pattern recognition (= learning)

This perspective originated with Claude Shannon, the founder of information theory, and continues to influence modern deep learning research2.

Why Unsupervised Learning Works

Let's consider the mystery of unsupervised learning, particularly LLM pre-training, from the perspective of "compression".

In LLM pre-training, the task of predicting the next token (word or part of a word) is used. For example, predicting "sunny" for the input "Today's weather is ____".

This next-token prediction essentially corresponds to data compression. This is because being able to efficiently predict tokens means capturing the redundancies (patterns) contained in the text3.

Ilya Sutskever (OpenAI Chief Scientist) has expressed the view that "compression equals generalization"4. According to this idea, a model that can efficiently compress data has captured the essential structure of that data, and therefore can make good predictions (generalization) for unseen data.

図表を生成中...

Understanding LLM Pre-training from a "Compression" Perspective

Viewing LLM pre-training as "compression" provides several interesting insights:

  1. Automatic Acquisition of Language Structure: To achieve efficient prediction (= compression), the model naturally learns grammar rules and common-sense knowledge
  2. Necessity of Large-scale Data: Better compression requires observing more patterns
  3. Explanation of Scaling Laws: Increasing model size allows capturing more complex patterns, improving compression (= prediction) performance
  4. Understanding Emergent Abilities: With sufficient compression capability, higher-order abilities (like reasoning) that weren't explicitly trained can be acquired

OpenAI researchers' description of their pre-training as "an attempt to find the shortest program that explains all the data humans have produced" shows they're viewing LLMs from this perspective of Solomonoff induction and compression.

Compression and Prediction: Information-Theoretical Connection

Viewing the relationship between compression and prediction from an information theory perspective provides even deeper understanding. Formally, a good predictor is a good compressor, and vice versa.

According to Shannon's information theory, the expected minimum number of bits required to encode a message following a probability distribution PP equals the entropy of that distribution, H(P)H(P).

H(P)=xP(x)log2P(x)H(P) = -\sum_x P(x) \log_2 P(x)

When using a probability distribution QQ to encode messages from distribution PP, the expected number of bits required becomes the cross-entropy H(P,Q)H(P,Q).

H(P,Q)=xP(x)log2Q(x)H(P,Q) = -\sum_x P(x) \log_2 Q(x)

Here, H(P,Q)H(P)H(P,Q) \geq H(P), with equality holding only when P=QP=Q. In other words, prediction distributions closer to the true distribution enable more efficient compression.

For example, "arithmetic coding" is a compression technique that concretely implements this principle. It performs encoding based on probability predictions for the next symbol, with higher prediction accuracy improving compression ratio. For a sequential data x1,x2,...,xnx_1, x_2, ..., x_n, the code length using arithmetic coding with probability model PP approaches:

Code lengthlog2P(x1,x2,...,xn)=i=1nlog2P(xix1,...,xi1)\text{Code length} \approx -\log_2 P(x_1, x_2, ..., x_n) = -\sum_{i=1}^n \log_2 P(x_i|x_1, ..., x_{i-1})

Conversely, the task "predict the next token" can be considered equivalent to the task "compress the text as much as possible"5. This is because a model's loss function (cross-entropy) is expressed as:

L=1Ni=1Nt=1TilogPθ(xi,txi,<t)\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T_i} \log P_\theta(x_{i,t}|x_{i,<t})

which directly relates to the average code length when encoding data using model PθP_\theta.

This is no coincidence but is rooted in the fundamental principles of information theory. The lower the uncertainty (entropy), the less information is required, enabling more efficient compression. Good prediction reduces uncertainty, which directly leads to efficient compression.

Evolution of Solomonoff Induction: From Past to Future

To understand the historical development and modern significance of Solomonoff induction, let's trace its temporal evolution.

Evolution of Solomonoff Induction

1956

Birth of AI

The field of AI is formally founded at the Dartmouth Conference. Ray Solomonoff also participates in this conference.

1960-1964

Proposal of Solomonoff Induction

Ray Solomonoff publishes a series of papers proposing a theory of universal inductive inference.

circa 1965

Development of Kolmogorov Complexity

Andrey Kolmogorov independently develops a measure of complexity (Kolmogorov complexity).

1990s

Popularization of Minimum Description Length Principle

The Minimum Description Length (MDL) principle, related to Solomonoff induction, gains attention in the machine learning field.

circa 2000

Emergence of the AIXI Model

Marcus Hutter presents the AIXI model that integrates Solomonoff induction with reinforcement learning. It attracts attention as a theoretical framework for universal AI.

Late 2010s

Rise of Large Language Models

LLMs rapidly develop with the success of Transformer architecture and large-scale unsupervised pre-training.

2022-2024

Connection Between LLMs and Solomonoff Induction

Discussion linking LLM success to 'compression' and citing Solomonoff induction as a theoretical basis becomes active.

March 2025

Proposal of the Kolmogorov Test

The 'Kolmogorov test', using compression through code generation to evaluate LLM capabilities, is proposed.

History and Development of the Theory

Solomonoff induction was proposed in the early 1960s by Ray Solomonoff (one of the founders of the AI field). He attempted to construct "a completely general and formal theory of inductive reasoning not limited to specific domains"6.

Interestingly, Solomonoff induction was not a mainstream idea in the early AI community. At that time, AI was primarily centered on symbolic processing approaches, and probabilistic methods didn't receive much attention7.

Around 2000, Marcus Hutter extended Solomonoff induction and combined it with reinforcement learning to propose the "AIXI model". This was a theoretical model of "universal AI" capable of optimal decision-making in any environment (though non-computable)8.

Connection with Bayesian Optimization and My Research

As someone who chose Bayesian optimization as the theme of my master's thesis, I have a strong interest in this area. And there are interesting connections between Solomonoff induction and Bayesian optimization.

At the core of Bayesian optimization is the concept of "balancing exploration and exploitation", a method to efficiently find optimal solutions with limited trials. Considering its theoretical limit, it conceptually resonates with the process of "exploring the space of possible hypotheses and identifying the best explanation" in Solomonoff induction.

The three points they share are:

  1. Modeling Prior Knowledge: Representing uncertainty as probability distributions
  2. Evidence-based Updates: Updating distributions based on new data
  3. Bias Toward Simplicity: Prioritizing simpler explanations

These principles can be applied not only to machine learning but also to human decision-making and learning processes. They bridge both aspects of my research (Bayesian optimization and Solomonoff induction).

Connection Points with Modern AI Research

In modern AI research, particularly LLM research, Solomonoff induction has several important connection points:

  1. Theoretical Explanation of Scaling Laws: Performance improvements with increased model size can be explained as the ability to capture more complex patterns (= better compression)
  2. Foundation for Self-supervised Learning: Providing a theoretical explanation of why self-supervised learning like next-token prediction produces deep understanding
  3. Analysis of Emergent Abilities: Explaining the principle by which higher-order abilities not explicitly learned emerge with sufficient compression capacity
  4. New Perspectives on Model Evaluation: Development of new metrics based on the principles of Solomonoff induction, such as "compression through code generation"

A hypothesis currently drawing particular attention is that "the Transformer architecture might be a good approximation of Solomonoff induction"9. Its ability to capture long-range dependencies and predict based on complex contexts might be implementing an implicit mixture of many simple "programs".

Practical Applications and Limitations

Let's consider how to apply Solomonoff induction theory to actual AI development and applications, as well as its limitations.

図表を生成中...

Overcoming the Wall of Non-computability

The greatest challenge of Solomonoff induction is that it is non-computable. This limitation stems mainly from three points:

  1. Non-computability of Kolmogorov Complexity: No algorithm exists that can calculate the true complexity of given data
  2. Infinite Program Space: Need to consider all possible programs
  3. Halting Problem: Cannot generally determine whether a program will halt

These theoretical barriers mean that implementing complete Solomonoff induction is fundamentally impossible. However, this doesn't prevent the development of practical approximation methods.

Actual approximation approaches include:

  • Computational Resource Limitations: Setting upper limits on execution time or length of programs considered (such as approximations of the AIXI model)
  • Hypothesis Space Limitations: Restricting to specific model classes (like neural networks)
  • Practical Compression Algorithms: Using practical compression algorithms instead of theoretical Kolmogorov complexity
  • Meta-learning Approaches: Training models to approximate the behavior of Solomonoff induction by experiencing diverse prediction tasks

Influence on AI System Design

The perspective of Solomonoff induction can influence actual AI system design in the following ways:

  1. Emphasis on Unsupervised Pre-training: Designs based on compression learning from large volumes of data
  2. Explicit Introduction of Bias Toward Simplicity: Use of regularization or structured architectures
  3. Understanding the Value of Diverse Experience: Importance of learning broad knowledge rather than specific domains
  4. New Metrics for Model Evaluation: Using compression efficiency to evaluate model capabilities

For example, in LLM design, one could consider pre-training methods that explicitly tackle compression tasks rather than mere next-token prediction. Research is also progressing on evaluating how efficiently a model's internal representations compress data10.

Four points of particular note in the latest research trends from 2024 to early 2025 are:

  1. Kolmogorov Test (KT): A benchmark using "compression through code generation" to evaluate LLM capabilities. It evaluates reasoning and compression abilities by having LLMs generate the shortest program that outputs a given data sequence11.
  2. Explicit Approximation of Solomonoff Induction: Research attempting to explicitly reproduce the characteristics of Solomonoff induction by training neural networks from universal data12.
  3. Relationship Between Transformers and Solomonoff Induction: Research theoretically analyzing why the Transformer architecture might be a better approximation of Solomonoff induction than other architectures13.
  4. Ethical Implications of Prior Probabilities: Discussion on how the universal prior probability in Solomonoff induction affects AI decision-making and values14.

My Perspective as an AI Engineer

As an AI engineer rather than a theoretical researcher, my practical insights from Solomonoff induction are as follows:

  1. Utility of the "Compression" Perspective: The perspective of "how efficiently can the model compress data" provides a very practical guideline in designing and evaluating machine learning models

  2. Importance of Bias Toward Simplicity: Provides theoretical grounds for prioritizing simplicity in model design to prevent overfitting and achieve good generalization

  3. Recognition of the Gap Between Theory and Practice: Even if theoretically optimal algorithms are non-computable, practical approaches inspired by their insights are valuable

  4. Balance Between Exploration and Exploitation: The principle of exploring the hypothesis space as widely as possible while concentrating resources on promising hypotheses is practically important even with finite computational resources

Summary and Outlook

We've seen how the seemingly complex concept of "Solomonoff induction" actually provides an important perspective for understanding modern AI, especially unsupervised learning.

The Essence of Unsupervised Learning from a Compression Perspective

One answer to the question "Why does unsupervised learning work?" lies in the concept of "compression". The view that efficient compression necessarily leads to good generalization provides a powerful framework that consistently explains various phenomena in deep learning.

Solomonoff induction gives this idea a theoretical foundation, mathematically formalizing the principle that "finding the shortest program leads to the best prediction". This can be seen as formalizing the intuition that "compression is generalization".

My Perspective as an AI Engineer and Future Directions

As an AI engineer, these are the four points I'd like to focus on in the future:

  1. Research on Architectures that Enhance Compression Efficiency: The perspective of how new architectures beyond Transformers achieve more efficient data compression
  2. Integration of Unsupervised Learning and Reinforcement Learning: Development of frameworks that integrate prediction and decision-making, like the AIXI extension of Solomonoff induction
  3. Models that Perform Efficient Compression Even at Small Scale: Model designs that can perform efficient compression learning even without large computational resources
  4. Standardization of Compression Capability Evaluation Metrics: Development of benchmarks like the Kolmogorov test that directly evaluate models' compression capabilities

The Beginning of an Intellectual Journey

In this article, I've explored the essence and modern significance of "Solomonoff induction", a term I encountered on X. While it has profound theoretical backgrounds, this concept with intuitive cores like "prioritize shorter programs" and "compression generates generalization" can serve as practical guidelines for AI engineers.

The intellectual curiosity journey that begins with encountering unfamiliar terms on X and leads to exploration always brings new discoveries. I hope this journey of capturing the essence of AI from the perspective of "compression" has provided you with some compression (= a new framework of understanding) in your thinking as well.

References

Footnotes

  1. Hutter, M. (2007). "A Theory of Universal Artificial Intelligence based on Algorithmic Complexity." arXiv:cs/0004001.

  2. Shannon, C.E. (1948). "A Mathematical Theory of Communication." The Bell System Technical Journal, 27(3), 379-423.

  3. Brown, T.B., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33.

  4. Sutskever, I. (2022). "How do we get to AGI from here? The role of RL, self-supervised learning, and meta-learning." Stanford HAI Seminar.

  5. Deletang, G., et al. (2023). "Language Modeling Is Compression." arXiv:2309.10668.

  6. Solomonoff, R.J. (1964). "A formal theory of inductive inference. Part I." Information and Control, 7(1), 1-22.

  7. Sterkenburg, T.F. (2016). "Solomonoff Prediction and Occam's Razor." Philosophy of Science, 83(4), 459-479.

  8. Hutter, M. (2000). "A Theory of Universal Artificial Intelligence based on Algorithmic Complexity." arXiv:cs/0004001.

  9. Akrout, M., et al. (2024). "Transformers As Approximations of Solomonoff Induction." arXiv:2408.12065.

  10. Deletang, G., et al. (2023). "Neural Networks as Compression: Insights from Deep Learning's Local Function Approximation." arXiv:2306.14747.

  11. Colby, L., et al. (2025). "The KoLMogorov Test: Compression by Code Generation." arXiv:2503.13992.

  12. Scherlis, A., et al. (2024). "Learning Universal Predictors." arXiv:2401.14953.

  13. Akrout, M., et al. (2024). "Transformers As Approximations of Solomonoff Induction." arXiv:2408.12065.

  14. Garrabrant, S., et al. (2023). "Concerns about the Solomonoff prior." Alignment Forum.

Ryosuke Yoshizaki

Ryosuke Yoshizaki

CEO, Wadan Inc. / Founder of KIKAGAKU Inc.

I am working on structural transformation of organizational communication with the mission of 'fostering knowledge circulation and driving autonomous value creation.' By utilizing AI technology and social network analysis, I aim to create organizations where creative value is sustainably generated through liberating tacit knowledge and fostering deep dialogue.

Get the latest insights

Subscribe to our regular newsletter for the latest articles and unique insights on the intersection of technology and business.