Evaluating LLMs that grade other LLMs¶

Comparing self-evaluation methods

Introduction¶

The enormous popularity of LLMs in recent years has spurred research into how to align their outputs with correctness and human intent. Ample examples of hallucinating (more correctly, delusional) outputs are now common fodder for online discussions warning us about the immaturity of LLM technology.

One approach to making LLM outputs logically and factually correct is to do prompt engineering that “guides” them. Chain-of-Thought (CoT) is one such approach. Unfortunately, LLMs are also good at producing entirely reasonably sounding CoT outputs that are, in fact, logically wrong.

The incorrectness of CoT outputs has, in turn, spurred recursive research into checking correctness of LLM outputs - can we get an LLM to grade the CoT of another LLM?

Evaluation in unsupervised settings has an “turtles all the way down” problem when it comes to certifying the output of any algorithm that claims to evaluate another one. How do we know the checker of the grader is not, itself, flawed? Algebraic evaluation escapes this infinite chain because it merely estimates sample statistics of correctness - it uses algebraic relations between statistics of agreement and disagreement observed counts to limit the evaluations that are logically consistent with them.

This notebook is going to show an early Proof-of-Concept of how algebraic evaluation can be used to grade a panel of LLMs that are judging the CoT output of another, different, LLM. Before these early, initial results are explained some general caveats.

The purpose of these experiments is not to compare the three different LLMs with each other after one has taken great care to tune them to their maximum possible accuracy. Very little prompt engineering or parameter tuning (temperature, etc) was done. The goal was to see how easy it was for a relative newbie to use three different commercial LLMs and get them to be error-independent on a CoT benchmark.
The computations shown here were done with a pre-v0.3 release of the package (currently at v0.2).

Experimental set-up¶

Benchmark¶

The experimental benchmark used is the multistep-arithmetic collection from the BIG-Bench-Mistake dataset. These are arithmetic problems like the following, $$1((-1 + 7 + 7 + 9) * (-3 * -1 * 3 * 4)) = \, ?$$ And the CoT-trace for the PaLM 2 response to this query is given as,

Thought 1: This equation can be written as “(A * B)”, where A = (-1 + 7 + 7 + 9) and B = (-3 * -1 * 3 * 4).
Thought 2: Let’s calculate A = (-1 + 7 + 7 + 9) = (22).
Thought 3: Let’s calculate B = (-3 * -1 * 3 * 4) = ((-3 * -1) * (3 * 4)) = (3 * 12) = (36).
Thought 4: Then, the final equation is (A * B) = (22 * 36) = 792. So the answer is 792.

Which happens to be a correct answer and a CoT free of reasoning errors. But these are a small portion of the dataset since its goal is to spur research into finding and correcting reasoning errors. For the evaluation carried out here, the 300 problems in the multistep-arithmetic benchmark were pared to 281 by eliminating the subset of Cot-style traces that either had incorrect reasoning but correct answers or correct reasoning but incorrect answers.

LLMs¶

Many problems in AI safety and alignment would be much simpler if we could just know how well we are doing on average. The approach taken here to solve this problem is to create error-independent panels of experts. This means that during a test the experts made errors independently of each other.

There are various engineering designs you can use to encourage error-independence when you have panels of experts. Three easy ones are,

Train them on different data.
Train them on different features.
Use different learning algorithms.

The one we used here is much simpler than this and is more “off-the-shelf” - use LLMs from different commercial vendors. We would expect different companies to do all 3 things in the list above for us. This is born out by an evaluation we carried out with three LLMs,

Gemini Pro
Mistral Large
GPT-4-Turbo

checking the output PaLM 2 CoT traces in the 282 multistep-arithmetic benchmark. Let’s walk through that evaluation now.

Comparing algebraic evaluation with majority voting¶

# We will be comparing two algebraic methods, ntqr and majority voting, with the ground truth
# The ground truth is given by the ntqr.r2.evaluators.SupervisedEvaluation class.
import ntqr
from ntqr.r2.evaluators import SupervisedEvaluation
# If we assume binary classifiers are making error independent during a test, we can
# use the exact, algebraic solution detailed in ntqr.r2.evaluators.ErrorIndependentEvaluation
from ntqr.r2.evaluators import ErrorIndependentEvaluation
# And finally, we import the class that does majority voting evaluations
from ntqr.r2.evaluators import MajorityVotingEvaluation

# Using the three LLMs above, we asked them to tell us if the PaLM 2 CoT-traces were wrong.
# - the 'a' label means 'Yes'
# - the 'b' label means 'No'
# Evaluation is treated as a data streaming problem. We just keep track of the
# counts of times a particular "vote pattern" - their ordered label decisions
# occurred. From this data sketch of the results of the three LLMs grading the PaLM 2
# CoT we can compute the ground truth for their performance, but also
# try to estimate that ground truth when we get to observe only the decision
# and do not know the true label
data_sketch = {
    "a": {
        ("b", "b", "a"): 169,
        ("b", "a", "a"): 24,
        ("a", "b", "b"): 3,
        ("b", "b", "b"): 23,
        ("a", "b", "a"): 16,
        ("a", "a", "a"): 2,
    },
    "b": {
        ("b", "b", "b"): 17,
        ("b", "b", "a"): 20,
        ("b", "a", "b"): 1,
        ("a", "b", "b"): 3,
        ("a", "b", "a"): 3,
    },
}

# Let's do the supervised evaluation (ground truth) for this experiment
from pprint import pp
s_eval = SupervisedEvaluation(ntqr.r2.datasketches.TrioLabelVoteCounts(data_sketch))
pp(s_eval.evaluation_exact)

It is a little bit hard to rank fractions so let’s turn these exact ground truth values to floats.

pp(s_eval.evaluation_float)

{'prevalence': {'a': 0.8434163701067615, 'b': 0.15658362989323843},
 'accuracy': [{'a': 0.08860759493670886, 'b': 0.8636363636363636},
              {'a': 0.10970464135021098, 'b': 0.9772727272727273},
              {'a': 0.890295358649789, 'b': 0.4772727272727273}],
 'pair_correlation': {(0, 1): {'a': -0.0012818458580355712,
                               'b': -0.0030991735537190084},
                      (0, 2): {'a': -0.002937563424664851,
                               'b': -0.0030991735537190084},
                      (1, 2): {'a': 0.012035108333778419,
                               'b': -0.011880165289256199}}}

There are some things to note:

Two of the LLMs are doing a bad job on one of the labels - recognizing that the PaLM CoT was wrong. That is the ‘a’ label and Gemini Pro and Mistral are not recognizing the PaLM 2 mistakes.
But GPT-4-Turbo, which does well on finding the wrong Cot-traces, does badly on the correct ones.
And finally, success! We set out to create a test where the experts were uncorrelated. And by using different LLM vendors, we were able to do it. The largest pair correlation is about 1.2% for the ‘a’ label and classifiers (1,2).

Now comes the interesting part of the experiment - does having error indendent, or nearly error independent, LLMs help us in evaluating them in an unsupervised setting?

# In a unsupervised setting, we don't get to see the voting counts by true label
# so we need to 'project' the ground truth to what we would observe
voting_counts = ntqr.r2.datasketches.TrioLabelVoteCounts(data_sketch).to_TrioVoteCounts()
pp(voting_counts)

TrioVoteCounts(vote_counts={('a', 'a', 'a'): 2,
                            ('a', 'a', 'b'): 0,
                            ('a', 'b', 'a'): 19,
                            ('a', 'b', 'b'): 6,
                            ('b', 'a', 'a'): 24,
                            ('b', 'a', 'b'): 1,
                            ('b', 'b', 'a'): 189,
                            ('b', 'b', 'b'): 40})

# We can now calculate the error-independent evaluations and the majority voting one.
ae_eval = ErrorIndependentEvaluation(voting_counts)
mv_eval = MajorityVotingEvaluation(voting_counts)

# Let's compare the prevalence estimates. Remember that most of the
# PaLM CoT-traces are wrong - the 'a' label has a prevalence of 84.3%
# on the test
print("The two solutions from the erro-independent model:")
pp([sol["prevalence"]["a"] for sol in ae_eval.evaluation_float])
print("The two solutions from the majority voting evaluation - majority is always right or always wrong")
pp([sol["prevalence"]["a"] for sol in mv_eval.evaluation_float])

So the ‘a’ label prevalence is estimated to be 84.0% by MV, a value very close to the true value of 84.3%. The error-independent estimate is off at 77.6%. How about the label accuracies for the three grading LLMs?

%matplotlib inline
from ntqr.r2.plots import compare_evaluations, plot_evaluations

size=40
seval_exact = s_eval.evaluation_exact
ae_eval_exact = ae_eval.evaluation_exact
mv_eval_exact = mv_eval.evaluation_exact

# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","white",size],
    ["error independent",
     [d['a']*100 for d in ae_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in ae_eval_exact[sol]["accuracy"]],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[sol]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],figsize=(10,5),legend_loc="best",withArrows=False)

%matplotlib inline
from ntqr.r2.plots import compare_evaluations, plot_evaluations

size=40
seval_exact = s_eval.evaluation_exact
ae_eval_exact = ae_eval.evaluation_exact
mv_eval_exact = mv_eval.evaluation_exact

# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [d['a']*100 for d in ae_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in ae_eval_exact[sol]["accuracy"]],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],figsize=(10,5),legend_loc="lower left",withArrows=False)

The error-independent seems to be very good for classifiers 1 and 3. But what happened to the estimate for classifier 2? Let’s check.

# Looking at the error-independent evaluation results
["error independent",
     [float(d['a'])*100 for d in ae_eval_exact[sol]["accuracy"]],
     [float(d['b'])*100 for d in ae_eval_exact[sol]["accuracy"]]]

['error independent',
 [8.00451496620146, 13.40896693859222, 92.76370786462849],
 [84.8263773143796, 103.57683264602629, 49.64993540953866]]

So the error-independent solution thinks the 2nd classifier has a greater than 100% accuracy (103.6%) on label ‘b’!

The beauty of knowing you are wrong¶

It may sound strange but the failure above by the error-independent model is actually a ‘feature’ not a ‘bug’. It goes back to the type of numbers that are returned by ntqr.r2.MajorityVotingEvaluation and ntqr.r2.ErrorIndependentEvaluation. Majority voting will always return rational number estimates. But the error indendent solution may not. In the example we are looking at here, it returns irrational algebraic numbers.

# The algebraic solution from the error-independent evaluation
ae_eval_exact[0]["prevalence"]["a"]

# Majority voting never returns anything but rationals
mv_eval_exact[0]["prevalence"]["a"]

../_images/388a55e076cb43fd145e6fb625bf23d708ba475df88e1b4bb748437fa676e94a.png

Not only is Majority Voting always going to return rational estimates, it will also return estimates logically consistent with the observed responses from the classifiers. We can check this by looking to see if the Majority Voting estimates satisfy the single classifier axioms.

Majority voting always gives logically consistent estimates¶

# Let's import the single binary classifier axiom
from ntqr.r2.paxioms import single_binary_classifier_axiom, pa, pb, pia, pib, fai, fbi
single_binary_classifier_axiom

# Let's check that the axiom is identically zero with the Majority Voting solution
# for all three classifiers
sol = mv_eval_exact[0]
for classifier in range(3):
    vals = {pa: sol["prevalence"]["a"], pb: sol["prevalence"]["b"],
            pia: sol["accuracy"][classifier]["a"], pib: sol["accuracy"][classifier]["b"],
            fai: voting_counts.classifier_label_frequency(classifier,"a"),
            fbi: voting_counts.classifier_label_frequency(classifier,"b"),}
    pp("classifier {}: {}".format(classifier, single_binary_classifier_axiom.subs(vals)))

'classifier 0: 0'
'classifier 1: 0'
'classifier 2: 0'

# The second MV solution is when we assume the majority is always wrong
sol = mv_eval_exact[1]
for classifier in range(3):
    vals = {pa: sol["prevalence"]["a"], pb: sol["prevalence"]["b"],
            pia: sol["accuracy"][classifier]["a"], pib: sol["accuracy"][classifier]["b"],
            fai: voting_counts.classifier_label_frequency(classifier,"a"),
            fbi: voting_counts.classifier_label_frequency(classifier,"b"),}
    pp("classifier {}: {}".format(classifier, single_binary_classifier_axiom.subs(vals)))

'classifier 0: 0'
'classifier 1: 0'
'classifier 2: 0'

So there is nothing we can do to improve the majority voting solution in an unsupervised setting. It always returns rational estimates and they are always logically consistent with the observed disagreements and agreements between the classifiers.

The estimates returned by ntqr.r2.ErrorIndependentEvaluation are not like that. They actually signal that its estimates cannot possibly be right as evidenced by those pesky unresolved square root factors in the estimates. Let’s check whether the error-independent estimates obey the single binary classifier axiom.

Does ntqr.ErrorIndependentEvaluation yield logically inconsistent estimates?¶

# The strength of the ErrorIndependent evaluation is that it can detect
# its assumption - the classifiers are error-independent in the test -
# is violated. We can see this in the unresolved square root
# But does the error-independent solution obey the single binary
# classifier axiom?
sol = ae_eval_exact[1]
for classifier in range(3):
    vals = {pa: sol["prevalence"]["a"], pb: sol["prevalence"]["b"],
            pia: sol["accuracy"][classifier]["a"], pib: sol["accuracy"][classifier]["b"],
            fai: voting_counts.classifier_label_frequency(classifier,"a"),
            fbi: voting_counts.classifier_label_frequency(classifier,"b"),}
    simpfl_axiom_val = sympy.simplify(single_binary_classifier_axiom.subs(vals))
    pp("classifier {}: {}".format(classifier, simpfl_axiom_val))
    pp("classifier {}: {}".format(classifier, float(single_binary_classifier_axiom.subs(vals))))

'classifier 0: 0'
'classifier 0: 1.8465957235571472e-127'
'classifier 1: 0'
'classifier 1: 3.6931914471142943e-127'
'classifier 2: 0'
'classifier 2: 1.4772765788457177e-126'

So the irrational estimates returned by the error-independent solution obey the single binary classifier axiom. Nonetheless, we know it must be wrong since any finite test can only have rational numbers as correct estimates. So let’s find the nearest rational evaluation given the observed data sketch.

Finding the nearest logically consistent solution to the error-independent estimate¶

# The ntqr.r2.evaluations.PosteriorSingleEvaluations helps find
# the logically consistent solutions nearest to the ones returned by
# ntqr.r2.ErrorIndependentEvaluation
from ntqr.r2.evaluations import PosteriorSingleEvaluations
pevals = PosteriorSingleEvaluations(voting_counts)

# Let's find the closest solutions to each classifier by itself.
sol = ae_eval_exact[1]
size_of_test = 281
for classifier in range(3):
    # We scale the points
    
    ae_eval_classifier = (sol["prevalence"]["a"],
                          sympy.simplify(sol["accuracy"][classifier]["a"]),
                          sympy.simplify(sol["accuracy"][classifier]["b"]))
    print("Classifier {}:".format(classifier))
    # Collect the min distance at each qa
    min_distance_evals = []
    for qa in range(200,240):
        qa_classifier_evals = pevals.find_k_nearest_at_prevalence(classifier, qa, ae_eval_classifier, 3)
        nearest_eval = qa_classifier_evals[0]
        min_distance_evals.append((nearest_eval[0][0]+nearest_eval[0][1], (qa, nearest_eval[1])))
    min_distance_evals.sort()
    pp([(float(distance),vals) for distance, vals in min_distance_evals[:2]])

Classifier 0:
[(3.941162729658634e-05, (217, (17, 54))),
 (5.338280039733239e-05, (218, (17, 53)))]
Classifier 1:
[(0.0013843679500308574, (218, (27, 63))),
 (0.001388932455474871, (217, (27, 64)))]
Classifier 2:
[(2.0950709169504348e-05, (218, (202, 31))),
 (2.2354493397940512e-05, (219, (203, 31)))]

So we must consider all classifiers at once since doing closest individual distance leads to contradictory estimates for the ‘a’ label prevalence.

# The closest solution for all classifiers at once
sol = ae_eval_exact[1]
size_of_test = 281
qa_estimate = size_of_test*sol["prevalence"]["a"]
ae_points = [(sol["prevalence"]["a"],
              sol["accuracy"][classifier]["a"],
              sol["accuracy"][classifier]["b"])
             for classifier in range(3)]
for qa in range(215,220):
    logical_evals = pevals.find_k_nearest_at_prevalence_all_classifiers(qa, ae_points, 5)
    closest_eval = logical_evals[0]
    print("qa: {}, distance: {}, point: {}".format(qa, float(closest_eval[0]), closest_eval[1]))

qa: 215, distance: 0.0015415001921062165, point: ((17, 56), (27, 66), (201, 33))
qa: 216, distance: 0.0014517201408929684, point: ((17, 55), (27, 65), (201, 32))
qa: 217, distance: 0.001434953069518205, point: ((17, 54), (27, 64), (202, 32))
qa: 218, distance: 0.0014582820752696278, point: ((17, 53), (27, 63), (202, 31))
qa: 219, distance: 0.0014662486510221135, point: ((18, 53), (27, 62), (203, 31))

# We keep the closest solution at qa=217
nearest_logical_responses = ((17, 54), (27, 64), (202, 32))

size=40
qa = 217
qb = 281-qa
# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
suptitle="Comparing methods to self-evaluate LLMs grading PaLM 2 CoT-traces\n1: Gemini Pro, 2: Mistral Medium, 3: GPT-4-Turbo"
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [raa/qa*100 for raa, rbb in nearest_logical_responses],
     [rbb/qb*100 for raa, rbb in nearest_logical_responses],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],
    figsize=(10,5),legend_loc="lower left",withArrows=False, suptitle=suptitle)

../_images/ea0186648b1cbb24f8bd50ce32d2f23bac02e10231b0c321aa8fc8ebc0f630b2.png

From the point of view of AI safety majority voting is making an egregious mistake with the evaluation of the 2nd LLM here - Mistral Medium. It thinks, incorrectly, that it is above 50% on both labels. Note that that the error-independent solution (corrected by projecting to the nearest logical one) does a much better job of telling us that all three LLMs are doing badly (worse than random) on one label or the other.

Spoofed or flipped data sketches¶

One way to understand the difference between ntqr.r2.ErrorIndependentEvaluatino and ntqr.r2.MajorityVotingEvaluation is to create spoofed or altered data sketches. For example, in our example here we know that classifiers 1 and 2 are doing badly on identifying PaLM 2 CoT-traces that have errors. Since we have access to the ground truth, we can simulate someone spoofing the data sketch by just flipping the decisions on the label ‘a’ instances.

In the following cells we are going to create these spoofed data sketches so as to place the LLMs in the different quadrants (A, B, C, D) of the performance comparison plots.

f12a_tlvc = ntqr.r2.datasketches.TrioLabelVoteCounts(data_sketch).flip_classifiers_label_decisions((0,1),'a')
pp(f12a_tlvc)

TrioLabelVoteCounts(label_vote_counts={'a': {('a', 'a', 'a'): 169,
                                             ('a', 'a', 'b'): 23,
                                             ('a', 'b', 'a'): 24,
                                             ('a', 'b', 'b'): 0,
                                             ('b', 'a', 'a'): 16,
                                             ('b', 'a', 'b'): 3,
                                             ('b', 'b', 'a'): 2,
                                             ('b', 'b', 'b'): 0},
                                       'b': {('a', 'a', 'a'): 0,
                                             ('a', 'a', 'b'): 0,
                                             ('a', 'b', 'a'): 3,
                                             ('a', 'b', 'b'): 3,
                                             ('b', 'a', 'a'): 0,
                                             ('b', 'a', 'b'): 1,
                                             ('b', 'b', 'a'): 20,
                                             ('b', 'b', 'b'): 17}})

pp(data_sketch)

{'a': {('b', 'b', 'a'): 169,
       ('b', 'a', 'a'): 24,
       ('a', 'b', 'b'): 3,
       ('b', 'b', 'b'): 23,
       ('a', 'b', 'a'): 16,
       ('a', 'a', 'a'): 2},
 'b': {('b', 'b', 'b'): 17,
       ('b', 'b', 'a'): 20,
       ('b', 'a', 'b'): 1,
       ('a', 'b', 'b'): 3,
       ('a', 'b', 'a'): 3}}

%matplotlib inline
from ntqr.r2.plots import compare_evaluations, plot_evaluations

size=40
seval_exact = ntqr.r2.evaluators.SupervisedEvaluation(f12a_tlvc).evaluation_exact
ae_eval_exact = ntqr.r2.evaluators.ErrorIndependentEvaluation(f12a_tlvc.to_TrioVoteCounts()).evaluation_exact
mv_eval_exact = ntqr.r2.evaluators.MajorityVotingEvaluation(f12a_tlvc.to_TrioVoteCounts()).evaluation_exact

# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [d['a']*100 for d in ae_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in ae_eval_exact[sol]["accuracy"]],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],figsize=(10,5),legend_loc="best",withArrows=False)

# And now moving all the classifiers to quadrant D
f12a12b_tlvc = f12a_tlvc.flip_classifiers_label_decisions((0,1),'b')
pp(f12a12b_tlvc)

TrioLabelVoteCounts(label_vote_counts={'a': {('a', 'a', 'a'): 169,
                                             ('a', 'a', 'b'): 23,
                                             ('a', 'b', 'a'): 24,
                                             ('a', 'b', 'b'): 0,
                                             ('b', 'a', 'a'): 16,
                                             ('b', 'a', 'b'): 3,
                                             ('b', 'b', 'a'): 2,
                                             ('b', 'b', 'b'): 0},
                                       'b': {('a', 'a', 'a'): 20,
                                             ('a', 'a', 'b'): 17,
                                             ('a', 'b', 'a'): 0,
                                             ('a', 'b', 'b'): 1,
                                             ('b', 'a', 'a'): 3,
                                             ('b', 'a', 'b'): 3,
                                             ('b', 'b', 'a'): 0,
                                             ('b', 'b', 'b'): 0}})

%matplotlib inline
from ntqr.r2.plots import compare_evaluations, plot_evaluations

size=40
curr_tlvc = f12a12b_tlvc
seval_exact = ntqr.r2.evaluators.SupervisedEvaluation(curr_tlvc).evaluation_exact
ae_eval_exact = ntqr.r2.evaluators.ErrorIndependentEvaluation(curr_tlvc.to_TrioVoteCounts()).evaluation_exact
mv_eval_exact = ntqr.r2.evaluators.MajorityVotingEvaluation(curr_tlvc.to_TrioVoteCounts()).evaluation_exact

# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [d['a']*100 for d in ae_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in ae_eval_exact[sol]["accuracy"]],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],figsize=(10,5),legend_loc="lower left",withArrows=False)

ntqr.r2.evaluators.ErrorIndependentEvaluation(curr_tlvc.to_TrioVoteCounts()).evaluation_float

[{'prevalence': {'a': 0.22374136668648772, 'b': 0.7762586333135123},
  'accuracy': [{'a': 0.848263773143796, 'b': 0.0800451496620146},
   {'a': 1.0357683264602628, 'b': 0.1340896693859222},
   {'a': 0.5035006459046134, 'b': 0.07236292135371519}]},
 {'prevalence': {'a': 0.7762586333135123, 'b': 0.22374136668648772},
  'accuracy': [{'a': 0.9199548503379854, 'b': 0.15173622685620403},
   {'a': 0.8659103306140777, 'b': -0.03576832646026274},
   {'a': 0.9276370786462849, 'b': 0.4964993540953866}]}]

ntqr.r2.evaluators.SupervisedEvaluation(curr_tlvc).evaluation_float

{'prevalence': {'a': 0.8434163701067615, 'b': 0.15658362989323843},
 'accuracy': [{'a': 0.9113924050632911, 'b': 0.13636363636363635},
  {'a': 0.890295358649789, 'b': 0.022727272727272728},
  {'a': 0.890295358649789, 'b': 0.4772727272727273}],
 'pair_correlation': {(0, 1): {'a': -0.0012818458580355712,
   'b': -0.0030991735537190084},
  (0, 2): {'a': 0.002937563424664851, 'b': 0.0030991735537190084},
  (1, 2): {'a': -0.012035108333778419, 'b': 0.011880165289256199}}}

# The closest solution for all classifiers at once
curr_tlvc = f12a12b_tlvc
pevals = PosteriorSingleEvaluations(curr_tlvc.to_TrioVoteCounts())
sol = ae_eval_exact[1]
size_of_test = 281
qa_estimate = size_of_test*sol["prevalence"]["a"]
ae_points = [(sol["prevalence"]["a"],
              sol["accuracy"][classifier]["a"],
              sol["accuracy"][classifier]["b"])
             for classifier in range(3)]
for qa in range(210,220):
    logical_evals = pevals.find_k_nearest_at_prevalence_all_classifiers(qa, ae_points, 5)
    closest_eval = logical_evals[0]
    print("qa: {}, distance: {}, point: {}".format(qa, float(closest_eval[0]), closest_eval[1]))

qa: 210, distance: 0.0024158837281476656, point: ((194, 11), (183, 0), (198, 35))
qa: 211, distance: 0.0022394220649198833, point: ((195, 11), (184, 0), (198, 34))
qa: 212, distance: 0.0019827070981761168, point: ((195, 10), (185, 0), (199, 34))
qa: 213, distance: 0.00182876389353067, point: ((196, 10), (186, 0), (200, 34))
qa: 214, distance: 0.0016284247866806291, point: ((197, 10), (187, 0), (200, 33))
qa: 215, distance: 0.0015415001921062165, point: ((198, 10), (188, 0), (201, 33))
qa: 216, distance: 0.0014517201408929684, point: ((199, 10), (189, 0), (201, 32))
qa: 217, distance: 0.001434953069518205, point: ((200, 10), (190, 0), (202, 32))
qa: 218, distance: 0.0014582820752696278, point: ((201, 10), (191, 0), (202, 31))
qa: 219, distance: 0.0014662486510221135, point: ((201, 9), (192, 0), (203, 31))

# We keep the closest solution at qa=217
nearest_logical_responses = ((200, 10), (190, 0), (202, 32))

size=40
qa = 217
qb = 281-qa
# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
suptitle="Comparing methods to self-evaluate LLMs grading PaLM 2 CoT-traces\n1: Gemini Pro, 2: Mistral Medium, 3: GPT-4-Turbo"
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [raa/qa*100 for raa, rbb in nearest_logical_responses],
     [rbb/qb*100 for raa, rbb in nearest_logical_responses],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],
    figsize=(10,5),legend_loc="lower left",withArrows=False, suptitle=suptitle)

../_images/d24a618f45e0d6ebb54c45a96de60523184ce9790f439a6797b67d42448b2f9c.png

Majority voting really struggles in this case. All three LLMs are in quadrant D, but MV thinks 1,3 are healthy as they are in quadrant A. Let’s confirm that the problem is related to which quadrant the classifiers occupy by placing all the LLMs in quadrant A.

Majority voting works best when all classifiers are in quadrants A or C¶

f12a_tlvc = ntqr.r2.datasketches.TrioLabelVoteCounts(data_sketch).flip_classifiers_label_decisions((0,1),'a')
f12a3b_tlvc = f12a_tlvc.flip_classifiers_label_decisions((2,),'b')
pp(f12a3b_tlvc)

TrioLabelVoteCounts(label_vote_counts={'a': {('a', 'a', 'a'): 169,
                                             ('a', 'a', 'b'): 23,
                                             ('a', 'b', 'a'): 24,
                                             ('a', 'b', 'b'): 0,
                                             ('b', 'a', 'a'): 16,
                                             ('b', 'a', 'b'): 3,
                                             ('b', 'b', 'a'): 2,
                                             ('b', 'b', 'b'): 0},
                                       'b': {('a', 'a', 'a'): 0,
                                             ('a', 'a', 'b'): 0,
                                             ('a', 'b', 'a'): 3,
                                             ('a', 'b', 'b'): 3,
                                             ('b', 'a', 'a'): 1,
                                             ('b', 'a', 'b'): 0,
                                             ('b', 'b', 'a'): 17,
                                             ('b', 'b', 'b'): 20}})

%matplotlib inline
from ntqr.r2.plots import compare_evaluations, plot_evaluations

size=40
curr_tlvc = f12a3b_tlvc
seval_exact = ntqr.r2.evaluators.SupervisedEvaluation(curr_tlvc).evaluation_exact
ae_eval_exact = ntqr.r2.evaluators.ErrorIndependentEvaluation(curr_tlvc.to_TrioVoteCounts()).evaluation_exact
mv_eval_exact = ntqr.r2.evaluators.MajorityVotingEvaluation(curr_tlvc.to_TrioVoteCounts()).evaluation_exact

# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [d['a']*100 for d in ae_eval_exact[sol]["accuracy"]],
     [d['b']*100 for d in ae_eval_exact[sol]["accuracy"]],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],figsize=(10,5),legend_loc="lower left",withArrows=False)

# The closest solution for all classifiers at once
curr_tlvc = f12a3b_tlvc
pevals = PosteriorSingleEvaluations(curr_tlvc.to_TrioVoteCounts())
sol = ae_eval_exact[1]
size_of_test = 281
qa_estimate = size_of_test*sol["prevalence"]["a"]
ae_points = [(sol["prevalence"]["a"],
              sol["accuracy"][classifier]["a"],
              sol["accuracy"][classifier]["b"])
             for classifier in range(3)]
for qa in range(239,250):
    logical_evals = pevals.find_k_nearest_at_prevalence_all_classifiers(qa, ae_points, 5)
    closest_eval = logical_evals[0]
    print("qa: {}, distance: {}, point: {}".format(qa, float(closest_eval[0]), closest_eval[1]))

qa: 239, distance: 0.0029988157263349077, point: ((222, 42), (210, 40), (213, 23))
qa: 240, distance: 0.0026547300521559186, point: ((222, 41), (210, 39), (213, 22))
qa: 241, distance: 0.0023142016680816656, point: ((222, 40), (210, 38), (214, 22))
qa: 242, distance: 0.002123637704796693, point: ((222, 39), (210, 37), (214, 21))
qa: 243, distance: 0.0019794640700376504, point: ((222, 38), (211, 37), (215, 21))
qa: 244, distance: 0.0018392651837432657, point: ((222, 37), (211, 36), (215, 20))
qa: 245, distance: 0.0018740634829720292, point: ((222, 36), (211, 35), (216, 20))
qa: 246, distance: 0.0018454109792019893, point: ((222, 35), (211, 34), (216, 19))
qa: 247, distance: 0.0021240660222634567, point: ((222, 34), (211, 33), (217, 19))
qa: 248, distance: 0.0021861702681115693, point: ((222, 33), (211, 32), (217, 18))
qa: 249, distance: 0.0027286780118859653, point: ((222, 32), (211, 31), (217, 17))

# We keep the closest solution at qa=244
nearest_logical_responses = ((222, 37), (211, 36), (215, 20))

size=40
qa = 244
qb = 281-qa
# We know to pick the 2nd solution as the one closest
# to ground truth
sol = 1
suptitle=("Comparing methods to self-evaluate LLMs grading PaLM 2 CoT-traces",
          "LLM outputs spoofed to put them all in quadrant A",
          "1: Gemini Pro, 2: Mistral Medium, 3: GPT-4-Turbo")
compare_evaluations([
    ["ground truth", 
     [d['a']*100 for d in seval_exact["accuracy"]],
     [d['b']*100 for d in seval_exact["accuracy"]],
      "x","b",size],
    ["error independent",
     [raa/qa*100 for raa, rbb in nearest_logical_responses],
     [rbb/qb*100 for raa, rbb in nearest_logical_responses],
      "^", "k",size],
    ["majority voting",
     [d['a']*100 for d in mv_eval_exact[0]["accuracy"]],
     [d['b']*100 for d in mv_eval_exact[0]["accuracy"]],
      "o", "k",size],],0,
    titles=["Ground Truth","Error Independent", "Majority Voting"],
    figsize=(10,5),legend_loc="lower left",withArrows=False, suptitle="\n".join(suptitle))

../_images/9cc6e20babad0e20d1946ab1651bcc1f504c63fe5b8842f1895f4b9b1a70cbe2.png