The Possible and Consistent Evaluation Sets

The principal “geometric” objects in any logic of unsupervised evaluation are the possible and consistent sets of evaluations. The possible set are all the possible evaluations that a test could have. For classification and multiple-choice tests with fixed number of responses, these sets are finite and we can compute their size. The consistent set are a subset, sometimes a very small one, of the possible set. These are the evaluations that are logically consistent with the counts of observed decisions by an ensemble of test takers.

This notebook demonstrates the new classes in v0.8 of NTQR that allow you to generate and sample these sets for any number of classes and classifiers.

import itertools, random, collections
import sympy

import ntqr
import ntqr.raxioms
import ntqr.evaluations

Hide code cell source

%%capture
# Some ugly duck taping to get Jupyter notebooks and HTML documentation pages
# derived from them look decent.
from IPython.display import display,Math,Latex, HTML

sympy.init_session(quiet=True, auto_symbols=False);
sympy.init_printing(print_builtin=False, use_latex='mathjax', latex_mode='plain', order='old')

# We also need so duck taping for showing symbolic solutions to equations
def equations_to_flalign(eqs):
    lines = []
    for eq in eqs:
        lines.append(f"& {sympy.latex(eq)} &")

    body = r" \\\\ ".join(lines)

    latex_str = r"\begin{flalign*}" + body + r"\end{flalign*}"

    return latex_str

The space of sample test statistics

The possible and consistent set are geometric objects in the space of variables used to denote unknown sample statistics of a test just taken by \(N\) classifiers. NTQR’s code is “general” in the sense that you get to state what the labels are and their sorting order. The “R” in the NTQR stands for the number of labels or responses test takers can make answering a question in the test. The same goes for how to denote classifiers. Your list defines sorting order for them.

NTQR uses the SymPy package to do symbolic computations under the hood. There are \(R^(N+1)\) of these variables and the ntqr.statistics.ResponseVariables class does this work.

labels = ('a', 'b', 'c')
classifiers = ('i', 'j', 'k')
# The class ntqr.statistics.ResponseVariables gives
# us access to the label response variables
rVars = ntqr.statistics.ResponseVariables(labels, classifiers)
# We can find them in the .label_response property of the class as a dict
# indexed by label
rVars.label_responses['b']
{('a', 'a', 'a'): R_{a_{i},a_{j},a_{k},b},
 ('a', 'a', 'b'): R_{a_{i},a_{j},b_{k},b},
 ('a', 'a', 'c'): R_{a_{i},a_{j},c_{k},b},
 ('a', 'b', 'a'): R_{a_{i},b_{j},a_{k},b},
 ('a', 'b', 'b'): R_{a_{i},b_{j},b_{k},b},
 ('a', 'b', 'c'): R_{a_{i},b_{j},c_{k},b},
 ('a', 'c', 'a'): R_{a_{i},c_{j},a_{k},b},
 ('a', 'c', 'b'): R_{a_{i},c_{j},b_{k},b},
 ('a', 'c', 'c'): R_{a_{i},c_{j},c_{k},b},
 ('b', 'a', 'a'): R_{b_{i},a_{j},a_{k},b},
 ('b', 'a', 'b'): R_{b_{i},a_{j},b_{k},b},
 ('b', 'a', 'c'): R_{b_{i},a_{j},c_{k},b},
 ('b', 'b', 'a'): R_{b_{i},b_{j},a_{k},b},
 ('b', 'b', 'b'): R_{b_{i},b_{j},b_{k},b},
 ('b', 'b', 'c'): R_{b_{i},b_{j},c_{k},b},
 ('b', 'c', 'a'): R_{b_{i},c_{j},a_{k},b},
 ('b', 'c', 'b'): R_{b_{i},c_{j},b_{k},b},
 ('b', 'c', 'c'): R_{b_{i},c_{j},c_{k},b},
 ('c', 'a', 'a'): R_{c_{i},a_{j},a_{k},b},
 ('c', 'a', 'b'): R_{c_{i},a_{j},b_{k},b},
 ('c', 'a', 'c'): R_{c_{i},a_{j},c_{k},b},
 ('c', 'b', 'a'): R_{c_{i},b_{j},a_{k},b},
 ('c', 'b', 'b'): R_{c_{i},b_{j},b_{k},b},
 ('c', 'b', 'c'): R_{c_{i},b_{j},c_{k},b},
 ('c', 'c', 'a'): R_{c_{i},c_{j},a_{k},b},
 ('c', 'c', 'b'): R_{c_{i},c_{j},b_{k},b},
 ('c', 'c', 'c'): R_{c_{i},c_{j},c_{k},b}}

The possible set

An evaluation for \(N\) classifiers is the count of each of the \(R^N\) events given true label. For a test of size \(Q\), these possible evaluations are parametrized by points in the answer key simplex – the count of the labels in any assumed answer key. Let’s take a look at the class in ntqr.statistics that constructs the variables that specify a point in the Q-simplex.

qVars = ntqr.statistics.AnswerKeyVariables(labels)
qVars.qs
mappingproxy({'a': Qₐ, 'b': Q_b, 'c': Q_c})

The count of possible states for even a small test can be quite large. We are going to look at a \(Q=20\) test with our running example of three labels and three classifiers, and we will pick a “center” point in the Q-simplex: (6,8,6).

ql = (6,8,6)
pSet = ntqr.evaluations.PossibleSet(labels,classifiers)
pSet.set_count_at_ql(ql)
\[\displaystyle 14909583151850720256\]

This is \(1.5 \; 10^19\). This is huge. NTQR includes a set generator that will try to produce all of these evaluations. Conveniently, we can also sample from this set randomly.

[np.array(point) for point in pSet.random_points(ql,3)]
[array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0],
        [0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 1, 0]]),
 array([[0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 1, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
         0, 1, 0, 1, 0]]),
 array([[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0,
         0, 1, 0, 0, 0],
        [1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0]])]

The consistent set

We have chosen to parametrize the possible sets by points the answer key simplex or Q-simplex. The input into all the computations of the possbile set are just the labels, classifier names, and point in the Q-simplex. To compute the consistent set, we need just one more additional input – the count of agreements and disagreements between the classifiers. These are just the observed counts of the \(R^N\) ways they could agree and disagree.

For the purposes of illustrating the software, we create a synthetic set of the counts by independently sampling from three noisy classifiers. Note that this has no implications for the universality of this code – it applies to any set of classifiers, error independent or not. This is clearly demonstrated by the fact that the logic only uses these counts, nothing else, no compute the consistent set.

There are no tuning parameters in NTQR, all its computations are based on the counts alone. This is what makes this a logic universally applicable to all classification tests, in any domain.

# We start with a small test so we can compute the possible set in a
# reasonable amount of time.
answer_key = [random.choice(labels) for i in range(20)]
label_accuracies = [{label:random.uniform(0.45,0.9) for label in labels} for classifier in classifiers]
test_results = [tuple(tlabel if label_accuracies[j][tlabel]>random.random()  else random.choice([ol for ol in labels if ol != tlabel]) 
                 for j in range(len(classifiers)) )
                for tlabel in answer_key]
counts=collections.Counter(test_results)
print(counts)
Counter({('c', 'c', 'c'): 4, ('b', 'b', 'b'): 3, ('a', 'a', 'a'): 2, ('a', 'c', 'c'): 2, ('b', 'b', 'a'): 2, ('a', 'c', 'b'): 1, ('c', 'a', 'c'): 1, ('a', 'b', 'b'): 1, ('b', 'a', 'a'): 1, ('c', 'b', 'b'): 1, ('a', 'a', 'c'): 1, ('c', 'b', 'c'): 1})

These observed counts demonstrate a pattern that helps to make computation of the possible set much faster than the possible one – most of the possible \(R^N\) counts are not observed or have singleton counts.

Let’s time how long it takes to compute the consistent set. We pick a “center” point in the Q-simplex, (6, 8, 6), for our simulated test of size \(Q=20.\)

import time
cSet = ntqr.evaluations.ConsistentSet(labels,classifiers,counts)
start = time.perf_counter()
print(sum(1 for point in cSet.set_generator((6,8,6))))
end = time.perf_counter()

print(f"Elapsed time: {end - start:.4f} seconds")
# Let us remind ourselves of the size of the possible set
# at this Q-simplex point.
pSet = ntqr.evaluations.PossibleSet(labels, classifiers)
pSet.set_count_at_ql((6,8,6))
\[\displaystyle 14909583151850720256\]

Just like the possible set, we can generate random points in the consistent set.

[np.array(point) for point in cSet.random_points(ql,3)]
[array([[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0,
         1, 1, 0, 0, 1],
        [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 2]], dtype=object),
 array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
         1, 0, 0, 0, 3],
        [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 1],
        [1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]], dtype=object),
 array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1],
        [1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 3],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
         0, 1, 0, 0, 0]], dtype=object)]

The correct cuboid generator and random sampler

The evaluations in the possible and consistent sets are unique. The process that generates them never creates duplicates. This is not the case when we marginalize the joint evaluations for all of them. One particular object of interest is the correct cuboid – the marginalization of their joint responses to their individually correct count. There is one such cuboid for each label.

NTQR also has a generator and random sampler for the label correct cuboids from the consistent set. These are no longer unique since many evaluations marginalize to any given individual counts.

[np.array(correct_point) for correct_point in cSet.random_correct_cuboid_points(ql,3)]
[array([[2, 1, 1],
        [3, 4, 3],
        [0, 3, 2]], dtype=object),
 array([[2, 2, 1],
        [2, 2, 1],
        [1, 3, 2]], dtype=object),
 array([[3, 3, 3],
        [1, 2, 2],
        [2, 1, 2]], dtype=object)]