A high-level, conceptual explanation of NTQR logic

No one knows the ground truth in unsupervised settings. Nonetheless, how experts disagree can help us exclude possible evaluations for them. NTQR approaches this problem as a logical question - What are the group evaluations that are logically consistent with how we observe experts disagreeing in their decisions?

For example, if two classifiers disagree in their decisions, they cannot both be one hundred per cent correct. This is a purely logical argument of excluding a possible group evaluation (both are completely correct) based on the fact that they disagreed. Their disagreement is not logically consistent with assigning them a perfect evaluation.

This simple example exhibits all the traits of how we can create a logic of unsupervised evaluation that is universal and useful. It is universal because it has no semantics about the classification task or internal knowledge about how the classifiers operate. The two experts disagreeing in the example above could be human or robots, it does not matter.

The only input used for the algorithms in NTQR are the observed counts of how classifiers agreed and disagreed when labeling items. There are \(R^N\) ways that N classifiers can agree/disagree between R labels. A classification test can thus be summarized by the observed counts of these events. The total of the event counts is, by construction, equal to the size of the test, \(Q\).

By talking about counts of events that we label arbitrarily, we have stripped the test of any semantic information. All we are left with is the count of their agreement/disagreements on a finite set of responses. This makes this counting logic universal - it applies to any classification test.

If we knew how these event counts were partitioned across the true labels, we would have enough information to calculate average correct and incorrect decisions. For example, we could calculate average label accuracy for any classifier by marginalizing out the other ones.

This logic is much easier to formalize because it is guaranteed to be complete in any domain. Consider the unknown answer key to a classification test. We can represent any such key as a tuple of the number of times a label appears in it. This maps any possible answer key to an integer point in an R-dimensional space. The finite integer points in this space are complete – they are guaranteed to trap any possible answer key.

Because of this completeness, we can trade uncertainty about world models that experts may have into uncertainty about how to evaluate their decisions. The answer key simplex is the same for all tests of size Q and R labels.

This is useful but one must be careful to not ascribe magical powers to purely logical arguments. We are trading domain verification for test verification. Logic alone cannot do this validation. Even something so simple as the size of the test needs to be validated as appropriate in any given monitoring application.

The NTQR package contains the linear algebra algorithms that allow you to calculate the logically consistent set. This is a subset of all possible grades. These sets can be represented by the equations that define them. All possible grades obey the simplex and marginalization equations. The logical set are those that obey the observable equations.