The Possible and Consistent Evaluation Sets¶

The principal “geometric” objects in any logic of unsupervised evaluation are the possible and consistent sets of evaluations. The possible set are all the possible evaluations that a test could have. For classification and multiple-choice tests with fixed number of responses, these sets are finite and we can compute their size. The consistent set are a subset, sometimes a very small one, of the possible set. These are the evaluations that are logically consistent with the counts of observed decisions by an ensemble of test takers.

This notebook demonstrates the new classes in v0.8 of NTQR that allow you to generate and sample these sets for any number of classes and classifiers.

import itertools, random, collections
import numpy as np
import sympy

import ntqr
import ntqr.raxioms
import ntqr.evaluations

The space of sample test statistics¶

The possible and consistent set are geometric objects in the space of variables used to denote unknown sample statistics of a test just taken by \(N\) classifiers. NTQR’s code is “general” in the sense that you get to state what the labels are and their sorting order. The “R” in the NTQR stands for the number of labels or responses test takers can make answering a question in the test. The same goes for how to denote classifiers. Your list defines sorting order for them.

NTQR uses the SymPy package to do symbolic computations under the hood. There are \(R^{N+1}\) of these variables and the ntqr.statistics.ResponseVariables class does this work.

labels = ('a', 'b', 'c')
classifiers = (1,2,3,4)
# The class ntqr.statistics.ResponseVariables gives
# us access to the label response variables
rVars = ntqr.statistics.ResponseVariables(labels, classifiers)
# We can find them in the .label_response property of the class as a dict
# indexed by label, and then events
list(itertools.islice(rVars.label_responses['b'].items(),10))

[(('a', 'a', 'a', 'a'), R_{a_{1},a_{2},a_{3},a_{4},b}),
 (('a', 'a', 'a', 'b'), R_{a_{1},a_{2},a_{3},b_{4},b}),
 (('a', 'a', 'a', 'c'), R_{a_{1},a_{2},a_{3},c_{4},b}),
 (('a', 'a', 'b', 'a'), R_{a_{1},a_{2},b_{3},a_{4},b}),
 (('a', 'a', 'b', 'b'), R_{a_{1},a_{2},b_{3},b_{4},b}),
 (('a', 'a', 'b', 'c'), R_{a_{1},a_{2},b_{3},c_{4},b}),
 (('a', 'a', 'c', 'a'), R_{a_{1},a_{2},c_{3},a_{4},b}),
 (('a', 'a', 'c', 'b'), R_{a_{1},a_{2},c_{3},b_{4},b}),
 (('a', 'a', 'c', 'c'), R_{a_{1},a_{2},c_{3},c_{4},b}),
 (('a', 'b', 'a', 'a'), R_{a_{1},b_{2},a_{3},a_{4},b})]

The possible set¶

An evaluation for \(N\) classifiers is the count of each of the \(R^N\) events given true label. For a test of size \(Q\), these possible evaluations are parametrized by points in the answer key simplex – the count of the labels in any assumed answer key. Let’s take a look at the class in ntqr.statistics that constructs the variables that specify a point in the Q-simplex.

qVars = ntqr.statistics.AnswerKeyVariables(labels)
qVars.qs

mappingproxy({'a': Qₐ, 'b': Q_b, 'c': Q_c})

The count of possible states for even a small test can be quite large. We are going to look at a \(Q=10\) test with our running example of three labels and four classifiers, and we will pick a “center” point in the Q-simplex: (3,4,3).

ql = (3,4,3)
pSet = ntqr.evaluations.PossibleSet(labels,classifiers)
pSet.set_count_at_ql(ql)

\[\displaystyle 16289075433767661\]

This is \(1.6 \; 10^{16}\). This is huge. NTQR includes a set generator that will try to produce all of these evaluations. Conveniently, we can also sample from this set randomly.

[np.array(point) for point in pSet.random_points(ql,2)]

[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])]

The consistent set¶

We have chosen to parametrize the possible sets by points in the answer key simplex or Q-simplex. The input into all the computations of the possbile set are just the labels, classifier names, and point in the Q-simplex. To compute the consistent set, we need just one more additional input – the count of agreements and disagreements between the classifiers. These are just the observed counts of the \(R^N\) ways they could agree and disagree.

For the purposes of illustrating the software, we create a synthetic set of the counts by independently sampling from three noisy classifiers. Note that this has no implications for the universality of this code – it applies to any set of classifiers, error independent or not. This is clearly demonstrated by the fact that the logic only uses these counts, nothing else, no compute the consistent set.

There are no tuning parameters in NTQR, all its computations are based on the counts alone. This is what makes this a logic universally applicable to all classification tests, in any domain.

# We start with a small test so we can compute the possible set in a
# reasonable amount of time. For three labels, R=3, Q=10 will work.
Q=10
answer_key = [random.choice(labels) for i in range(Q)]
label_accuracies = [{label:random.uniform(0.45,0.9) for label in labels} for classifier in classifiers]
test_results = [tuple(tlabel if label_accuracies[j][tlabel]>random.random()  
                             else random.choice([ol for ol in labels if ol != tlabel]) 
                             for j in range(len(classifiers)) )
                             for tlabel in answer_key]
counts=collections.Counter(test_results)
print(len(counts))
counts

Counter({('a', 'a', 'a', 'a'): 2,
 ('a', 'c', 'a', 'c'): 1,
 ('b', 'c', 'a', 'a'): 1,
 ('b', 'c', 'a', 'c'): 1,
 ('b', 'c', 'a', 'b'): 1,
 ('b', 'b', 'b', 'a'): 1,
 ('c', 'b', 'b', 'b'): 1,
 ('c', 'c', 'c', 'c'): 1,
 ('a', 'b', 'a', 'b'): 1})

These observed counts demonstrate a pattern that helps to make computation of the consistent set much faster than the possible one – most of the possible \(R^N\) counts are not observed or have singleton counts. Here, \(R^N=3^4=81.\) But the Q=10 can at most show 10 and here showed 10 out of the 81 possible joint decision events.

Let’s time how long it takes to compute the consistent set. We pick a “center” point in the Q-simplex, (3,4,6), for our simulated test of size \(Q=10.\)

import time
cSet = ntqr.evaluations.ConsistentSet(labels,classifiers,counts)
ql=(3,4,3)
start = time.perf_counter()
print(sum(1 for point in cSet.set_generator(ql)))
end = time.perf_counter()

print(f"Elapsed time: {end - start:.4f} seconds")

# Let us remind ourselves of the size of the possible set
# at this Q-simplex point.
pSet = ntqr.evaluations.PossibleSet(labels, classifiers)
pSet.set_count_at_ql(ql)

\[\displaystyle 16289075433767661\]

Just like the possible set, we can generate random points in the consistent set. The point on a label simplex is a sparse array since the common case is that most entries are zero.

# A 'point' in evaluation space is a tuple of R sparse arrays of length R^N
# each. For such a small test, this sparse representation saves a lot of memory
[point for point in itertools.islice(cSet.random_set_generator(ql),3)]

[(HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 4 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>)),
 (HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>)),
 (HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>),
  HashablePoint(<Compressed Sparse Row sparse array of dtype 'int64'
  	with 3 stored elements and shape (81,)>))]

# Expanding them out is easy
[tuple(lp.toarray().reshape(-1) for lp in point) for point in itertools.islice(cSet.random_set_generator(ql),2)]

[(array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
  array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]),
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])),
 (array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
  array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]))]

The correct cuboid generator and random sampler¶

The evaluations in the possible and consistent sets are unique. The process that generates them never creates duplicates. This is not the case when we marginalize the joint evaluations for all of them. One particular object of interest is the correct cuboid – the marginalization of their joint responses to their individually correct count. There is one such cuboid for each label.

NTQR also has a generator and random sampler for the label correct cuboids from the consistent set. These are no longer unique since many evaluations marginalize to the same individual count. There are \(R\) correct cuboids, one per label, and each has dimension \(N\), the number of classifiers. This allows for visualizations of up to three classifiers at a time.

[correct_point for correct_point in itertools.islice(cSet.correct_cuboid_random_generator(ql),5)]

[((2, 0, 2, 0), (1, 1, 1, 1), (0, 2, 0, 0)),
 ((1, 1, 2, 1), (2, 1, 1, 1), (1, 1, 0, 0)),
 ((2, 1, 3, 1), (1, 3, 2, 2), (0, 2, 0, 1)),
 ((1, 0, 1, 0), (2, 1, 0, 2), (0, 1, 0, 1)),
 ((1, 0, 2, 1), (1, 2, 1, 1), (0, 2, 0, 1))]

# Change to Q=100
Q=100
answer_key = [random.choice(labels) for i in range(Q)]
label_accuracies = [{label:random.uniform(0.85,0.9) for label in labels} for classifier in classifiers]
test_results = [tuple(tlabel if label_accuracies[j][tlabel]>random.random()  
                             else random.choice([ol for ol in labels if ol != tlabel]) 
                             for j in range(len(classifiers)) )
                             for tlabel in answer_key]
counts=collections.Counter(test_results)
print(len(counts))

# Change Q to be large and more in line
# with practical values
# And now we want to look at two sizes to make relative comparisons
# about how the position in the Q-simplex affects the evaluation set counts.
pSet = ntqr.evaluations.PossibleSet(labels,classifiers)
qCenter = (30,40,30)
qbVertex = (0,100,0)
print(f"Possible set at Q-simplex centroid: {pSet.set_count_at_ql(qCenter)}")
print(f"Possible set at Q-simplex vertex: {pSet.set_count_at_ql(qbVertex)}")

Possible set at Q-simplex centroid: 80182713893557948171112790866425308957144874682523175437142614181722882883310306703324
Possible set at Q-simplex vertex: 30077383103880443506960119717599095978648262854822110

# The consistent set should only have one member at the vertex
cSet = ntqr.evaluations.ConsistentSet(labels,classifiers,counts)
list(point for point in cSet.set_generator(qbVertex))

[(HashablePoint(<Compressed Sparse Row sparse matrix of dtype 'int64'
  	with 0 stored elements and shape (1, 81)>),
  HashablePoint(<Compressed Sparse Row sparse matrix of dtype 'int64'
  	with 28 stored elements and shape (1, 81)>),
  HashablePoint(<Compressed Sparse Row sparse matrix of dtype 'int64'
  	with 0 stored elements and shape (1, 81)>))]

# Another test is that the random generator for the ConsistentSet always
# returns the same point at a Q-simplex vertex
rand_points = list(itertools.islice(cSet.random_set_generator((qbVertex)), 100))
print(f"{cSet.is_valid_point(rand_points[33],qbVertex)}")
cSet.are_points_equal(rand_points[33], rand_points[34])

(True, 'Valid')

True

import time
# Evaluations in the consistent set are now sped up
# using numba JIT compiled code.
start = time.perf_counter()
rand_points = set(itertools.islice(cSet.random_set_generator((1,97,2)), 2000))
end = time.perf_counter()

print(f"Elapsed time: {end - start:.4f} seconds")

# Consistent sets are small near the Q-simplex vertex points,
# where only one label is present in the answer key
points=set(point for point in cSet.set_generator((1,97,2)))
len(points)