The Algebra and Geometry of the Axioms of Unsupervised Evaluation

import itertools, random, collections
import sympy
from IPython.display import display,Math,Latex, HTML

import ntqr
import ntqr.raxioms

Hide code cell source

%%capture
sympy.init_session(quiet=True, auto_symbols=False);
sympy.init_printing(print_builtin=False, use_latex='mathjax', latex_mode='plain', order='old') 

Introduction

This notebook explains the algebra and geometry of the axioms of unsupervised evaluation for N test-takers that answer Q questions that have R choices. This new version greatly simplifies what was an unnecessarily complicated presentation in earlier versions of the NTQR package. The geometry is best explained by detailing the evaluation model the logic uses. This is basically a counting model. Unknown ground truth values are integer counts. And we have two sets of these ground truth variables in the evaluation model. The first are the variables that summarize the test as a tuple of the label counts in it.

The answer key Q-simplex

The logic assumes that we have an unknown answer key for the classification test we have given. If such an answer key exists, we can summarize as a tuple of the label counts in it. The tuples of the form \((Q_{\ell_1}, Q_{\ell_2}, \ldots, Q_{\ell_R}).\) These tuples lie on a simplex in an \(R\)-dimensional space because they must satisfy the equation,

(1)\[\begin{equation} \sum_{\ell \in \mathcal{L}} Q_\ell = Q. \end{equation}\]

Denoting the space of the integers from 0 to \(Q\) by \(\mathbb{Q}_Q,\) the Q-simplex lies inside the hypercube of integers denoted by \(\mathbb{Q}_Q^R.\)

All statements about logical consistency between test-taker evaluation occur at a single point in the Q-simplex. It makes no sense to compare evaluations at different values of the Q-simplex since any given test can only occupy one point in the Q-simplex. Consequently, the logic works in a ladder-like manner. Once we pick a value for the Q-tuple, we can find all single classifier evaluations consistent with individual response statistics. These are the \(M=1\) evaluation axioms. This is the next step in the ladder of constructing the evaluation space for \(N\) test-takers.

Note an important property of this counting representation of the answer key – its completeness. If a test has an answer key, it will always lie on some unknown point in the Q-simplex. This is the fundamental basis for building no-knowledge algorithms - we can ask if any property is obeyed for all points in the Q-simplex, if it is, it must be true for the given test. Completeness allows us to always “trap” the ground truth answer key.

This completeness comes at a price. We have stripped our representation of the answer key of any information about the sequence of classifications. Different orders of the same items in the test all map into the same point the Q-simplex. No worries. Once you understand how to use this logic, you can use it to create a logic for sequences of any finite size in the test. That is an advanced topic.

The N=1 R-simplexes

Given a value of the Q tuple, \((Q_{\ell_1}, Q_{\ell_2}, \ldots, Q_{\ell_R}),\) we can ask what evaluations for a single test-taker \(i\) are consistent with the observed label responses of the test taker. These observable responses by the test taker are represented by variables of the form,

(2)\[\begin{equation} \{ R_{\ell_i} \}_{\ell_i \in \mathcal{L}} \end{equation}\]

What we seek are the values of the responses by true label that are consistent with these label responses by the test taker. These are denoted by variables of the form,

(3)\[\begin{equation} \{ R_{\ell_i,\ell_\text{true}} \}_{\ell_i \in \mathcal{L}}, \end{equation}\]

for each of the R labels. Given true label, all the unknown counts for a classifier must live on a simplex. For any \(\ell_{\text{true}}\) we must have the identity,

(4)\[\begin{equation} \sum_{\ell_i \in \mathcal{L}} R_{\ell_i,\ell_\text{true}} = Q_{\ell_\text{true}}. \end{equation}\]

The N=1 simplex for a single test taker is thus in a produc of spaces, one for each label,

(5)\[\begin{equation} (\mathbb{R}_{Q_{\ell_1}}^1) \otimes (\mathbb{R}_{Q_{\ell_2}}^1) \otimes \ldots (\mathbb{R}_{Q_{\ell_R}}^1). \end{equation}\]

The N=2 R-simplexes

The N=2 R-simplexes exist for all pairs of test-takers. These are now using the evidence in how a pair responded jointly to test questions. These are the observable response counts of the form \(R_{\ell_i, \ell_j}\). And the unknown label response statistics we seek are of the form,

(6)\[\begin{equation} \{ R_{\ell_i, \ell_j, \ell_\text{true}} \}_{\ell_i, \ell_j \in \mathcal{L}}, \end{equation}\]

These also lie on a simplex in a space that can be represented as,

(7)\[\begin{equation} (\mathbb{R}_{Q_{\ell_1}}^2) \otimes (\mathbb{R}_{Q_{\ell_2}}^2) \otimes \ldots (\mathbb{R}_{Q_{\ell_R}}^2). \end{equation}\]

The general N R-simplexes

In general, for N test-takers, we require a response space of the form,

(8)\[\begin{equation} (\mathbb{R}_{Q_{\ell_1}}^N) \otimes (\mathbb{R}_{Q_{\ell_2}}^N) \otimes \ldots (\mathbb{R}_{Q_{\ell_R}}^N). \end{equation}\]

The role of the axioms in defining the variety of logically consistent group evaluations

To have a logic one needs axioms. You have already encountered some of the them. The equalities that must be universally true for any classification test - responses by true label are on simplexes. So we can call these the simplex axioms. These are invariants that must be obeyed by all unknown statistics by true label.

These simplex axioms are linear relations that define the possible set of evaluations for a test of size \(Q\) at a given Q-simplex point. We won’t get into how you calculate these integer points but by linear algebra considerations we can see that while the responses simplexes are defined in an enclosing space of dimension \(R*R^{N}=R^{N+1}\), the simplexes are \(R*(R^N - 1)\) dimensional.

Note that ntqr uses the standard convention in mathematics of expressing these invariants without the equal sign,

(9)\[\begin{equation} \sum_{\ell_i \in \mathcal{L}} R_{\ell_i,\ell_\text{true}} - Q_{\ell_\text{true}}, \end{equation}\]

and an implicit \(= 0\) is understood on the right side. Part of the reason for this convention is that, while we may be interested in the points in label response space that make the expression equal to zero, they are embedded in the larger space where the equality is not satisfied. The solution set has a geometry in this larger space.

If we could visualize the possible evaluations they would be inside enclosing cubes of sides of size \(Q_\ell\) inside each label space. These are called varieties. A bit of formalism overkill in this linear space but it turns out polynomials are lurking in the shadows of this logic so it makes sense to start using it early.

The observable axioms and the logically consistent set

The possible evaluations for an ensemble of size \(N\) are greatly reduced once we have observed the test results. This is because we can now construct another set of universally applicable invariants – the observable axioms. Any observed count for a pattern of decisions by the ensemble must be equal to a sum over its partition over true label. For example, for two classifiers doing three label classication, every count of pair decisions must obey the equation,

(10)\[\begin{equation} R_{\ell_i, \ell_j} = \sum_{\ell_t \in \mathcal{L}} R_{\ell_i, \ell_j; \ell_t}. \end{equation}\]

Just like the simplex axioms reduced the dimensionality of the possible evaluations, the logically consistent set allows us to reduce the dimensionality of the space by \(R^N - 1\). In addition, the geometry of the consistent set is not straightforward as it depends on the test results. But the consistent set will be smaller than the possible one because now we cannot range over possible integers that go from 0 to \(Q_\ell\). We must range from 0 to \(\text{min}(Q_\ell, R_{\ldots}).\) Any decision event by true label must be equal to the min of the true label count in the answer key or the observed test responses.

# NTQR supports constructing the simplex and observable 
# axioms for any number of labels and classifiers. We'll
# use three labels (R=3) and three classifiers (N=3)
labels = ('a','b','c')
classifiers = ('i','j','k')
#
# Their simplex axioms
simplexAxioms = ntqr.raxioms.SimplexAxioms(labels, classifiers)
simplexAxioms
SimplexAxioms((('a', 'b', 'c'), ('i', 'j', 'k')))
# The .axioms property is a dictionary indexed by label for the 
# simplex axiom associated with it.
simplexAxioms.axioms['a']
\[\displaystyle Q_{a} - R_{a_{i},a_{j},a_{k},a} - R_{a_{i},a_{j},b_{k},a} - R_{a_{i},a_{j},c_{k},a} - R_{a_{i},b_{j},a_{k},a} - R_{a_{i},b_{j},b_{k},a} - R_{a_{i},b_{j},c_{k},a} - R_{a_{i},c_{j},a_{k},a} - R_{a_{i},c_{j},b_{k},a} - R_{a_{i},c_{j},c_{k},a} - R_{b_{i},a_{j},a_{k},a} - R_{b_{i},a_{j},b_{k},a} - R_{b_{i},a_{j},c_{k},a} - R_{b_{i},b_{j},a_{k},a} - R_{b_{i},b_{j},b_{k},a} - R_{b_{i},b_{j},c_{k},a} - R_{b_{i},c_{j},a_{k},a} - R_{b_{i},c_{j},b_{k},a} - R_{b_{i},c_{j},c_{k},a} - R_{c_{i},a_{j},a_{k},a} - R_{c_{i},a_{j},b_{k},a} - R_{c_{i},a_{j},c_{k},a} - R_{c_{i},b_{j},a_{k},a} - R_{c_{i},b_{j},b_{k},a} - R_{c_{i},b_{j},c_{k},a} - R_{c_{i},c_{j},a_{k},a} - R_{c_{i},c_{j},b_{k},a} - R_{c_{i},c_{j},c_{k},a}\]
# The observable axioms are more numerous, we have R^N of
# them, instead of R like the simplex axioms.
obsAxioms = ntqr.raxioms.ObservableAxioms(labels,classifiers)
Matrix(obsAxioms.axioms)
\[\begin{split}\displaystyle \left[\begin{matrix}R_{a_{i},a_{j},a_{k},a} + R_{a_{i},a_{j},a_{k},b} + R_{a_{i},a_{j},a_{k},c} - R_{a_{i},a_{j},a_{k}}\\R_{a_{i},a_{j},b_{k},a} + R_{a_{i},a_{j},b_{k},b} + R_{a_{i},a_{j},b_{k},c} - R_{a_{i},a_{j},b_{k}}\\R_{a_{i},a_{j},c_{k},a} + R_{a_{i},a_{j},c_{k},b} + R_{a_{i},a_{j},c_{k},c} - R_{a_{i},a_{j},c_{k}}\\R_{a_{i},b_{j},a_{k},a} + R_{a_{i},b_{j},a_{k},b} + R_{a_{i},b_{j},a_{k},c} - R_{a_{i},b_{j},a_{k}}\\R_{a_{i},b_{j},b_{k},a} + R_{a_{i},b_{j},b_{k},b} + R_{a_{i},b_{j},b_{k},c} - R_{a_{i},b_{j},b_{k}}\\R_{a_{i},b_{j},c_{k},a} + R_{a_{i},b_{j},c_{k},b} + R_{a_{i},b_{j},c_{k},c} - R_{a_{i},b_{j},c_{k}}\\R_{a_{i},c_{j},a_{k},a} + R_{a_{i},c_{j},a_{k},b} + R_{a_{i},c_{j},a_{k},c} - R_{a_{i},c_{j},a_{k}}\\R_{a_{i},c_{j},b_{k},a} + R_{a_{i},c_{j},b_{k},b} + R_{a_{i},c_{j},b_{k},c} - R_{a_{i},c_{j},b_{k}}\\R_{a_{i},c_{j},c_{k},a} + R_{a_{i},c_{j},c_{k},b} + R_{a_{i},c_{j},c_{k},c} - R_{a_{i},c_{j},c_{k}}\\R_{b_{i},a_{j},a_{k},a} + R_{b_{i},a_{j},a_{k},b} + R_{b_{i},a_{j},a_{k},c} - R_{b_{i},a_{j},a_{k}}\\R_{b_{i},a_{j},b_{k},a} + R_{b_{i},a_{j},b_{k},b} + R_{b_{i},a_{j},b_{k},c} - R_{b_{i},a_{j},b_{k}}\\R_{b_{i},a_{j},c_{k},a} + R_{b_{i},a_{j},c_{k},b} + R_{b_{i},a_{j},c_{k},c} - R_{b_{i},a_{j},c_{k}}\\R_{b_{i},b_{j},a_{k},a} + R_{b_{i},b_{j},a_{k},b} + R_{b_{i},b_{j},a_{k},c} - R_{b_{i},b_{j},a_{k}}\\R_{b_{i},b_{j},b_{k},a} + R_{b_{i},b_{j},b_{k},b} + R_{b_{i},b_{j},b_{k},c} - R_{b_{i},b_{j},b_{k}}\\R_{b_{i},b_{j},c_{k},a} + R_{b_{i},b_{j},c_{k},b} + R_{b_{i},b_{j},c_{k},c} - R_{b_{i},b_{j},c_{k}}\\R_{b_{i},c_{j},a_{k},a} + R_{b_{i},c_{j},a_{k},b} + R_{b_{i},c_{j},a_{k},c} - R_{b_{i},c_{j},a_{k}}\\R_{b_{i},c_{j},b_{k},a} + R_{b_{i},c_{j},b_{k},b} + R_{b_{i},c_{j},b_{k},c} - R_{b_{i},c_{j},b_{k}}\\R_{b_{i},c_{j},c_{k},a} + R_{b_{i},c_{j},c_{k},b} + R_{b_{i},c_{j},c_{k},c} - R_{b_{i},c_{j},c_{k}}\\R_{c_{i},a_{j},a_{k},a} + R_{c_{i},a_{j},a_{k},b} + R_{c_{i},a_{j},a_{k},c} - R_{c_{i},a_{j},a_{k}}\\R_{c_{i},a_{j},b_{k},a} + R_{c_{i},a_{j},b_{k},b} + R_{c_{i},a_{j},b_{k},c} - R_{c_{i},a_{j},b_{k}}\\R_{c_{i},a_{j},c_{k},a} + R_{c_{i},a_{j},c_{k},b} + R_{c_{i},a_{j},c_{k},c} - R_{c_{i},a_{j},c_{k}}\\R_{c_{i},b_{j},a_{k},a} + R_{c_{i},b_{j},a_{k},b} + R_{c_{i},b_{j},a_{k},c} - R_{c_{i},b_{j},a_{k}}\\R_{c_{i},b_{j},b_{k},a} + R_{c_{i},b_{j},b_{k},b} + R_{c_{i},b_{j},b_{k},c} - R_{c_{i},b_{j},b_{k}}\\R_{c_{i},b_{j},c_{k},a} + R_{c_{i},b_{j},c_{k},b} + R_{c_{i},b_{j},c_{k},c} - R_{c_{i},b_{j},c_{k}}\\R_{c_{i},c_{j},a_{k},a} + R_{c_{i},c_{j},a_{k},b} + R_{c_{i},c_{j},a_{k},c} - R_{c_{i},c_{j},a_{k}}\\R_{c_{i},c_{j},b_{k},a} + R_{c_{i},c_{j},b_{k},b} + R_{c_{i},c_{j},b_{k},c} - R_{c_{i},c_{j},b_{k}}\\R_{c_{i},c_{j},c_{k},a} + R_{c_{i},c_{j},c_{k},b} + R_{c_{i},c_{j},c_{k},c} - R_{c_{i},c_{j},c_{k}}\end{matrix}\right]\end{split}\]

We can see from these equations that the dimensionality of the logically consistent evaluations is going to be smaller than the one for the possible set. But, in addition, these relations will make the consistent set have an “irregular” appearance since now, across labels, decision events are capped at their observed count. This can result in dimensional collapse for tests that are smaller than the size required to obtain a count for every possible event.

Let’s see that for four classifiers doing 5 label decisions. This would apply to, say, LLMs taking a multiple-choice exam with 5 choices per question.

# We create synthetic test results for a Q=300 test.
labels = ('a','b','c','d','e')
answer_key = [random.choice(labels) for i in range(300)]
counts = collections.Counter(answer_key) # We expect about 300/5=60 per label
counts
Counter({'c': 66, 'a': 62, 'e': 62, 'd': 57, 'b': 53})
counts.total()
300
# We assume they have a simple, independent error model
prob_correct = [random.uniform(0.4,0.9) for classifier in classifiers]
print(prob_correct)
decisions = [[labels[i] if prob_correct[i] > random.random() else random.choice([label for label in labels if label != answer])
              for i in range(len(classifiers))] 
             for answer in answer_key]
counts = collections.Counter([tuple(decision) for decision in decisions])
len(counts) # 5^3 = 125 would be a full count
[0.5700890062495862, 0.7382714302763238, 0.7908421837749215]
42
# But even for most dimensions, the size
# of the allowed integers is 2 (for count 1).
counts
Counter({('a', 'b', 'c'): 127,
 ('d', 'b', 'c'): 23,
 ('c', 'b', 'c'): 21,
 ('e', 'b', 'c'): 17,
 ('b', 'b', 'c'): 14,
 ('a', 'd', 'c'): 9,
 ('a', 'c', 'c'): 8,
 ('a', 'b', 'd'): 8,
 ('a', 'b', 'e'): 8,
 ('a', 'e', 'c'): 7,
 ('a', 'a', 'c'): 6,
 ('a', 'b', 'a'): 5,
 ('a', 'b', 'b'): 4,
 ('d', 'e', 'c'): 3,
 ('a', 'd', 'd'): 3,
 ('d', 'a', 'c'): 2,
 ('e', 'd', 'c'): 2,
 ('d', 'd', 'c'): 2,
 ('e', 'e', 'c'): 2,
 ('c', 'e', 'c'): 2,
 ('e', 'b', 'b'): 2,
 ('b', 'd', 'c'): 2,
 ('c', 'd', 'c'): 2,
 ('e', 'a', 'a'): 2,
 ('b', 'c', 'c'): 2,
 ('d', 'b', 'b'): 1,
 ('b', 'b', 'a'): 1,
 ('d', 'c', 'c'): 1,
 ('d', 'b', 'd'): 1,
 ('e', 'b', 'a'): 1,
 ('a', 'c', 'd'): 1,
 ('c', 'b', 'e'): 1,
 ('a', 'a', 'd'): 1,
 ('a', 'e', 'a'): 1,
 ('e', 'e', 'a'): 1,
 ('a', 'd', 'a'): 1,
 ('c', 'b', 'a'): 1,
 ('c', 'c', 'c'): 1,
 ('c', 'd', 'b'): 1,
 ('d', 'b', 'e'): 1,
 ('c', 'a', 'c'): 1,
 ('c', 'c', 'd'): 1})

Conclusion

This notebook should have given you an introduction to the integer label response spaces that are used in the logic as well as their geometry. Readers familiar with linear algebra and algebraic geometry may think it is overkill to invoke the terminology of the latter by refering to ideals and varieties. What is presented here is certainly understandable with the language of linear algebra alone. But polynomial, integer equations are also part of this logic. For example, if we ask for group evaluations where the classifiers are pair-independent in their errors, we have consider quadratic equations. For three-way independent classifers, cubic polynomials need to be solved.