Introduction: The Challenge of Investment Interview Systems
I was developing an AI system as the first prototype created through Vibe Coding collaboration, designed to help investors evaluate entrepreneurs. This system analyzes entrepreneurs' PDF materials and then deepens the evaluation through dialogue.
The most challenging aspect of this project was "having AI make subjective evaluations." When we ask AI, "How would you score this entrepreneur's technical skills?", what are we actually expecting? Can we even quantify an abstract concept like "technical skills"? And if we can, how reliable would that evaluation be?
To find answers to these questions, I went through many trials and errors. What I eventually arrived at was a mechanism for uncertainty modeling and updates based on Bayesian statistics. In this article, I want to share the challenges I faced and the solutions I found.
Subjective evaluation by AI might seem contradictory at first, but many systems are already doing this every day. For example, systems that estimate satisfaction from restaurant reviews or evaluate customer support quality. However, performing these evaluations with high reliability is more difficult than you might imagine.
Challenges in Investment Evaluation Systems
In my system, I needed to evaluate entrepreneurs on criteria such as:
- Market Analysis: Understanding of market size, growth rate, and competitive landscape
- Technical Skills: Technical advantages, feasibility, and innovation
- Team Composition: Team diversity, experience, and expertise
- Business Strategy: Business model, profitability, and scalability
- Track Record & Scalability: Past achievements and future growth potential
All of these include "subjective" elements. For instance, "technical skills" might be evaluated on different standards by Silicon Valley investors versus Japanese investors. Even the same investor might apply different levels of strictness depending on the field, whether it's AI or biotechnology.
More importantly, there's the question of how to express uncertainty in evaluations. There's a significant difference between "technical skills are 3 points, but I'm not confident" and "technical skills are definitely 3 points."
Another difficult problem was how to update evaluations when new information is obtained. If we evaluate "technical skills as 3 points" based on a PDF, how should we reflect additional information gained through Q&A?
I designed my AI evaluation system while facing these challenges.
Introducing Rubrics: Reducing Variability in Subjective Evaluations
The first thing I tackled was standardizing evaluation criteria. Rubrics proved to be very effective for this.
A rubric is a tool that sets specific criteria for each evaluation item and assesses achievement levels against those criteria. I learned about this concept when reading an educational theory book. In education, grading students is important, so there's a need to clearly define how many points to assign to each evaluation item.
In my case, I adopted Bloom's Revised Taxonomy as my evaluation criteria. This classifies depth of knowledge into 6 levels:
- Remember: Able to reproduce basic information (1 point)
- Understand: Can accurately explain the meaning of concepts (2 points)
- Apply: Can effectively use learned knowledge in practical situations (3 points)
- Analyze: Can break down information and clearly understand causal relationships (4 points)
- Evaluate: Can make logical judgments and determine validity (5 points)
- Create: Can produce new value or unique solutions (6 points)
Based on these 6 levels, I decided to assign values from 1.0 to 6.0 to each evaluation axis. For example, if "technical skills" were 4.5 points, it would mean the technical understanding and judgment were somewhere between "analyze" and "evaluate."
Using rubrics made my instructions to the AI clearer and significantly reduced the variability in evaluations. Instead of saying "please evaluate technical skills," I could instruct "please evaluate technical skills on a scale of 1.0 to 6.0 based on Bloom's Revised Taxonomy."
Considering Methods for Updating Evaluations
When having AI evaluate something, we need to update the evaluation not just once but each time new information comes in. For example, when evaluating an entrepreneur's technical skills, we might first evaluate based on presentation materials, then update our evaluation with information gained through Q&A.
So how should we update these evaluations?
Problems with Simple Update Methods
The first method that came to mind was a simple average. If the previous evaluation was 5 points and a new evaluation based on new information is 4 points, we update to (5+4)/2=4.5 points. This is simple, but there's a problem.
For instance, if after 100 interactions the evaluation of "technical skills is 5 points" has been established, it seems unnatural to give equal weight to a 4-point evaluation from just one new interaction.
To address this problem, I considered a weighted average:
For example, if there have been 100 previous interactions and one new interaction:
This way, evaluations accumulated over a long time won't change dramatically with one piece of new information. However, there's still a problem with this method: not all information has the same reliability.
For example, a confident statement like "this company's technology is truly innovative" and a vague statement like "the technology seems good, though I don't understand the details" shouldn't be weighted the same.
Furthermore, there's the question of how to express uncertainty in evaluations. There's a big difference between "5 points with confidence" and "probably around 5 points."
Moving Toward Updates That Consider Uncertainty
From these considerations, I decided to give evaluations two elements:
- Evaluation value (e.g., 5 points)
- Uncertainty (e.g., standard deviation 0.5)
Lower uncertainty means more confidence in the evaluation. For example:
- "5 points with confidence" → Evaluation value 5 points, standard deviation 0.2
- "Probably around 5 points" → Evaluation value 5 points, standard deviation 1.0
So how should we update evaluations with these two elements? That's where I arrived at Bayesian updates.
Foundations of Bayesian Statistics
Interpretations of Probability: Frequentist vs. Subjective
In statistics, there are two major approaches to interpreting probability: the "frequentist" approach and the "Bayesian (subjective)" approach.
The frequentist approach defines probability as "the relative frequency when trials are repeated infinitely." For example, saying that the probability of a coin landing heads is 0.5 means that if you flip the coin infinitely, the proportion of heads will converge to 0.5.
On the other hand, the Bayesian approach interprets probability as "subjective degree of confidence in a proposition." In this interpretation, as data and evidence increase, this subjective degree of confidence (probability) gets updated.
The Bayesian approach felt very intuitive for my evaluation system. This is because investors have a "subjective degree of confidence" about entrepreneurs' abilities and update that confidence based on new information gained through materials and dialogue. In other words, Bayesian statistics models human thought processes closely.
Bayes' Theorem
At the center of Bayesianism is "Bayes' Theorem." This is a theorem about conditional probability, expressed by the following equation:
Where:
- is the conditional probability of A given B (posterior probability)
- is the conditional probability of B given A (likelihood)
- is the unconditional probability of A (prior probability)
- is the unconditional probability of B (marginal likelihood)
Interpreting this theorem in the context of Bayesian updates:
Where is an evaluation parameter (e.g., "technical skills"), and is a newly observed value (e.g., "technical skills 4.5 points").
In other words, the probability distribution of the parameter after obtaining a new observation (posterior distribution) equals the probability of obtaining that observation given the parameter (likelihood) multiplied by the probability distribution of the parameter before observation (prior distribution), divided by the marginal likelihood of the observation.
Prior and Posterior Distributions
In the Bayesian update framework, the concepts of "prior distribution" and "posterior distribution" are important.
The prior distribution is the probability distribution of the parameter before obtaining a new observation. This is set based on past experience, expert knowledge, or simple assumptions.
The posterior distribution is the probability distribution of the parameter after obtaining an observation. This is obtained by updating the prior distribution using Bayes' theorem.
In this framework, the posterior distribution obtained based on one data point becomes the prior distribution when observing the next data point. In other words, as data increases, the distribution is gradually updated.
This is the basic idea of Bayesian updates. Next, let's look at how to express and update uncertainty.
Normal Distributions and Bayesian Updates
Why Normal Distributions?
When modeling uncertainty, normal distributions (Gaussian distributions) are particularly useful for several reasons:
- Mathematical tractability: Normal distributions have additivity, meaning the sum of multiple normal distributions is also a normal distribution
- Central limit theorem: The sum of many random variables tends to form a normal distribution around the mean, regardless of their original distributions
- Information-theoretic properties: Under given mean and variance, the distribution that maximizes entropy (a measure of uncertainty) is the normal distribution
- Computational efficiency: When both the prior distribution and likelihood are normal distributions, the posterior distribution can be analytically derived
In my evaluation system, I decided to express each evaluation axis parameter using a normal distribution. Specifically, I represent parameter as a pair of mean and standard deviation :
Here, represents the estimated value of evaluation axis (expected value of the evaluation), and represents its uncertainty (inverse of confidence). A smaller means higher confidence in the evaluation.
Derivation of Bayesian Updates Between Normal Distributions
The greatest appeal of Bayesian updates using normal distributions is that if both the prior distribution and likelihood are normal distributions, the posterior distribution will also be a normal distribution. Moreover, its mean and variance can be analytically derived.
Let's mathematically derive this. Assume the prior distribution and likelihood are the following normal distributions:
- Prior distribution:
- Likelihood:
Here, is the true evaluation value, is the observed value, and is the uncertainty of the observation.
By Bayes' theorem:
The probability density function of a normal distribution is:
Therefore:
Rearranging this with respect to (focusing on the quadratic terms in the exponent):
Where is a term independent of .
The exponential part of a normal distribution has the form . Comparing this with our equation:
From these equations, we can derive the parameters of the posterior distribution:
This is the general equation for Bayesian updates using normal distributions. Looking at this equation, you'll notice something interesting. The mean of the posterior distribution is a weighted average of the mean of the prior distribution and the observed value . Moreover, the weights are proportional to the precision parameters (inverse of variance) of each distribution.
In other words, information from sources with high uncertainty (large variance) is weighted less, while information from sources with high confidence (small variance) is weighted more. This makes intuitive sense.
Practical Bayesian Update Equations
When applying to an actual evaluation system, the uncertainty of the observation can be defined using the confidence of the observation (a value from 0 to 1):
The higher the confidence, the lower the uncertainty. For example, when confidence , ; when confidence , .
Using this, the Bayesian update equations become:
Here, and are the parameters before the update, is the observed value, and is the confidence.
Characteristics of Bayesian Updates
This Bayesian update equation has several important characteristics:
-
Higher confidence observations have greater influence: The higher the confidence , the more the observed value influences the updated mean
-
Uncertainty always decreases: The updated variance is always smaller than the prior variance
-
The higher the current uncertainty, the greater the influence of observations: The larger the current uncertainty , the more the observed value influences the updated mean
These characteristics make intuitive sense. Information with higher confidence has a greater impact on the evaluation, and uncertainty decreases as information accumulates. Also, if you lack confidence in your current evaluation, you'll give more weight to new information.
Implementation Example: Evaluation System Using Bayesian Updates
Here, I'll introduce an implementation example of an evaluation system using Bayesian updates. The basic implementation is simple, but its applications are wide-ranging.
Basic Implementation of Bayesian Updates
First, let's implement the core part of Bayesian updates:
import math
def bayes_update(prior_mean, prior_std, observed_value, confidence):
"""
Update parameters using Bayesian updates
Args:
prior_mean: Mean of the prior distribution
prior_std: Standard deviation of the prior distribution
observed_value: Observed value
confidence: Confidence in the observation (0.0 to 1.0)
Returns:
(posterior_mean, posterior_std): Updated mean and standard deviation
"""
# Calculate observation variance from confidence
observed_var = 1.0 / max(confidence, 0.001) # Prevent division by zero
# Prior distribution variance
prior_var = prior_std ** 2
# Calculate posterior distribution variance
posterior_var = (prior_var * observed_var) / (prior_var + observed_var)
posterior_std = math.sqrt(posterior_var)
# Calculate posterior distribution mean
posterior_mean = (
(observed_var * prior_mean + prior_var * observed_value) /
(prior_var + observed_var)
)
return posterior_mean, posterior_std
This function takes the parameters of the prior distribution, the observed value, and the confidence in the observation as inputs, and returns the parameters of the posterior distribution.
Application to an Actual Evaluation System
To incorporate this into an actual evaluation system, we need a data structure that manages parameters for each evaluation axis:
def initialize_parameters():
"""Initialize evaluation parameters"""
return {
'market_analysis': {'mean': 3.0, 'std': 1.0},
'technical_skills': {'mean': 3.0, 'std': 1.0},
'team_composition': {'mean': 3.0, 'std': 1.0},
'business_strategy': {'mean': 3.0, 'std': 1.0},
'track_record': {'mean': 3.0, 'std': 1.0}
}
def update_parameter(parameters, axis, observed_value, confidence):
"""
Update parameters for a specific evaluation axis
Args:
parameters: Parameter dictionary
axis: Name of the evaluation axis to update
observed_value: Observed value
confidence: Confidence in the observation (0.0 to 1.0)
Returns:
Updated parameter dictionary
"""
# Create a copy to avoid modifying the original data
updated_parameters = parameters.copy()
# Get current parameters
prior_mean = parameters[axis]['mean']
prior_std = parameters[axis]['std']
# Apply Bayesian update
posterior_mean, posterior_std = bayes_update(
prior_mean, prior_std, observed_value, confidence
)
# Set updated parameters
updated_parameters[axis]['mean'] = posterior_mean
updated_parameters[axis]['std'] = posterior_std
return updated_parameters
Using these functions, the evaluation system operates in the following flow:
- Set initial parameters (
initialize_parameters
) - Update parameters each time a new observation is obtained (
update_parameter
) - Output the final evaluation results
Concrete Example: Evaluating an Entrepreneur's Technical Skills
As an example, let's consider the "technical skills" evaluation axis for an entrepreneur. Suppose the current evaluation state and a new observation are as follows:
- Current state: Mean , standard deviation
- New observation: Evaluation value , confidence
Applying Bayesian updates:
So, after the update, the mean is 3.77 and the standard deviation is 0.692. The current evaluation of 3.2 was pulled toward the new observation of 4.5, updating to 3.77, but it didn't change all the way to nearly 4.5. Also, the uncertainty decreased from 0.8 to 0.692.
# Code for the concrete example
prior_mean, prior_std = 3.2, 0.8
observed_value, confidence = 4.5, 0.7
posterior_mean, posterior_std = bayes_update(
prior_mean, prior_std, observed_value, confidence
)
print(f"Before update: Mean {prior_mean}, Standard deviation {prior_std}")
print(f"Observation: {observed_value}, Confidence: {confidence}")
print(f"After update: Mean {posterior_mean:.2f}, Standard deviation {posterior_std:.2f}")
Running this code would produce output like:
Before update: Mean 3.2, Standard deviation 0.8
Observation: 4.5, Confidence: 0.7
After update: Mean 3.77, Standard deviation 0.69
Effective Question Design: Maximizing Information Gain
Another advantage of an evaluation system using Bayesian updates is that it provides a theoretical foundation for deciding "what to ask next." Let me share the insights I gained from my trials and errors, applying concepts from information theory to design questions that extract the maximum information within limited time.
Question Design to Maximize Information Gain
Time is limited in investment interviews. That's why it's important for each question to extract as much information as possible. I focused on information gain as a concept to measure the "information value" of questions.
First, to quantitatively measure the uncertainty of each evaluation axis, I use the concept of entropy. Entropy is a measure of uncertainty; the larger the value, the higher the uncertainty. The entropy of a normal distribution can be calculated with the following equation:
As you can see from this equation, as the standard deviation increases, the entropy also increases.
Information gain is the expected amount of information obtained from a question, expressed by:
Here, is the uncertainty (entropy) before the question, and is the expected uncertainty after the question. In other words, it's a measure of "how much uncertainty will be reduced by asking this question."
Trial and Error in Question Design
In my initial approach, I prepared question templates for each evaluation axis like "market analysis" and "technical skills," and chose questions about the axis with the highest uncertainty. However, I discovered major problems with this method:
- Time inefficiency: Each question could only evaluate one axis
- Lack of context: It was difficult to have natural conversations that built on previous answers
- Algorithmic limitations: Evaluating the "quality" of questions solely based on entropy was insufficient
As a more advanced approach, I also considered using PyTorch to optimize the information gain of questions as a latent variable, but decided it wasn't realistic considering the complexity of implementation and maintainability.
Shifting to Multi-Parameter Optimization
After much trial and error, I arrived at a method of "integrating multiple evaluation axes" in question design. This method aims to obtain more information with a single question by combining questions about multiple evaluation axes with high uncertainty.
Practical Approaches to Improve Question Quality
I found that incorporating the following elements was effective in improving the quality of questions:
-
Strategic integration of multiple evaluation axes
- Combine correlated axes like "technical skills" and "market analysis"
- Prioritize combining axes with high uncertainty
-
Integration with knowledge graphs
- Not just numerical evaluations, but utilizing knowledge graphs as described in Knowledge Graph Design & Implementation Guide: Realizing Dialog Systems through Relationship Modeling
- Include context specific to the evaluation target (industry characteristics, technical terms, etc.) in questions
-
Question design considering respondent psychology
- Empathetic introductions ("I understand about ~. Next, about ~")
- Explicit requests for specific examples or numbers (avoiding abstract answers)
- Contextual continuity by referencing previous answers
Examples and Effects of Question Design
Here's an example of an integrated question designed with this approach:
"I understand your company's technology stack. Next, I'd like to ask about technical advantages and market differentiation. How does this core technology create differences from competitors? Specifically, how does it provide technical solutions to customer needs in your target market? If you have specific examples or numbers, please share them."
This question is designed to extract information about both "technical skills" and "market analysis" at once. When I actually used it, compared to single-axis questions:
- Information density improved: Able to update multiple evaluation axes from a single answer
- Conversation naturalness improved: AI dialogue became a "natural interview" rather than a "mechanical questionnaire"
- Interview time reduced: Fewer questions needed to obtain the same amount of information
Practical Use of Entropy
I found that monitoring uncertainty (entropy) is useful not just for question selection but for optimizing the entire interview process.
- Convergence detection: When entropy falls below a certain value, it can be judged that sufficient information has been gathered
- Objective criteria for session ending: End when entropy for all evaluation axes falls below a threshold
- Switching to exploration mode: Move to more exploratory questions once uncertainty about major evaluation axes is resolved
This allowed for optimal use of limited time, improving both the efficiency and quality of interviews.
For a deeper understanding of context using knowledge graphs, please refer to Knowledge Graph Design & Implementation Guide: Realizing Dialog Systems through Relationship Modeling.
Results and Insights from Implementation
After actually implementing this evaluation system using Bayesian updates, I gained several interesting results and insights.
-
Improved evaluation stability: Compared to simple weighted averages, evaluations converged more stably
-
Visualization of uncertainty: Being able to explicitly show confidence in evaluations through standard deviation enabled distinguishing between "certain evaluations" and "uncertain evaluations"
-
Efficient information collection: Able to efficiently improve evaluation precision by prioritizing questions about evaluation axes with high uncertainty
-
Evaluation transparency: Able to mathematically explain why an evaluation resulted as it did
What was particularly impressive was that even the same "3 points" can have greatly different uncertainties. For example, I could now distinguish between evaluations like "technical skills are 3 points, but responses to technical questions were vague, so uncertainty is high" and "technical skills are 3 points, and they could explain specific implementation methods, so confidence is high."
Also, when delegating evaluation to AI, it's important that evaluation processes can be explained in a way humans can understand. Thanks to its mathematical foundation, the evaluation system using Bayesian updates can clearly explain "why this evaluation resulted." This was very important for enhancing the reliability of AI evaluations.
Summary and Future Directions
Uncertainty modeling through Bayesian updates is a powerful approach to the challenge of quantifying subjective evaluations by AI. It's particularly strong in its ability to handle both evaluation and uncertainty simultaneously, and to update appropriately with new information.
While I applied this method to an investment evaluation system, its range of applications is wide. For example:
- Customer support quality evaluation
- Learner skill assessment
- Skill matching evaluation for recruitment candidates
- Analysis of customer satisfaction or product evaluations
As for future developments, the following extensions are possible:
- Extension to multivariate normal distributions: Modeling correlations between evaluation axes
- Integration with Bayesian networks: Modeling more complex causal relationships
- Combination with active learning: Efficient information collection based on uncertainty
Finally, I want to emphasize that the essence of this method is "not fearing uncertainty." Rather, by explicitly modeling uncertainty and utilizing it, we can build more reliable evaluation systems. This is an important perspective in collaboration between AI and humans.
While the mathematical models and implementation details may seem complex, the basic idea is intuitive. The principle of "giving more weight to information with high confidence and tracking uncertainty quantitatively" can be applied to various evaluation systems. I hope you'll try incorporating it into your own projects.