What is Krippendorff's α and why do we use it in SuperLim?

The short answer

Krippendorff's α is a score between -1 and 1, in which 1 is the score for systems that answer every item in a test set correctly, and a score around 0 means that the system performs around chance level. Clearly negative scores mean that the system systematically gives incorrect answers.

SuperLim 2.0 uses Krippendorff's α as its default evaluation measure because it allows us to use a single measure for as many as tasks as possible: it can be applied to balanced and unbalanced test data sets, to labelling tasks with two or more classes, as well as to scoring tasks.

The longer answer

When we evaluate a system on a task, we compare the system's prediction for each item in the test set to the gold standard label that the test set creators have provided. The best way to learn about the strengths and weaknesses of a system is to look at this comparison from as many angles as possible. We can, for instance, calculate scores like precision , recall and accuracy for labelling tasks, or some kind of correlation for a scoring task or when we have ordered labels. These may all tell different stories, as they give us different perspectives on system performance. Zooming out, we can calculate averages over such evaluation measures, like f-score , or even averages over averages like macro-averaged f-score. This hides some details but it also allows us to put a single number on the system's performance as a whole. On the other extreme, zooming in, we can do error analysis and study subsets of the data or even individual predictions. All these things will help us judge and understand a system's behaviour better.

There isn't a single measure that will tell us everything – but there are contexts in which it is convenient to have one score to attach to a system. SuperLim is such a context, because we want to be able to compare many systems at a glance. For the SuperLim leaderboard, it is even the case that we would like to be able to create an average score over multiple tasks, so that we can arrive at a ranking of systems. Ideally, then, the scores we average over are of the same kind. So the question is: is there a single measure we can use on as many of the SuperLim tasks as possible?

Without going into too much detail, let us look at the kind of considerations that come into play when picking a single measure by discussing a couple of options:

Accuracy, the proportion of cases in which the system predicts what the gold standard says, is well-known, simple and extremely intuitive. It is, however, also known that even poor systems easily get high accuracy scores when the test data is skewed, that is, when one of the labels occurs significantly more often than the other(s).
F-score for one label is the harmonic average of precision (how many of the cases predicted by the system are actually correct?) and recall (how many of the actual cases are also predicted by the system?) with respect to that label. It can be used on a task with any number of labels and it is often used in skewed datasets. However, the reason it works well on skewed datasets is also its weakness, as it basically ignores all the other labels in the dataset. Even in a task with only two labels, f-score does not signal how good the system is at predicting the label that is not in focus.
Macro-averaged f-score is the mean of individual f-scores for all different labels. This solves the problem of focusing on one label, and it is a common strategy when evaluating tasks with three labels or more. For the same reasons, it is also meaningful to do so in a two-label task, even though it is rarely done. (As an aside, it turns out macro-averaged f-score of a binary task is a linear transformation of Krippendorff's α for that task.)
The options discussed above are all grounded in inspecting the equivalence between predictions and gold standard labels. But for tasks that have ordered labels (ordinal scale) or where the system needs to predict a score (interval or ratio scale) we miss information when we do this. In these tasks, we care more about the distance between predictions and gold standard scores. It is common to use correlation measures like Pearson's and Spearman's coefficients. The disadvantage of these is that they ignore the sizes of the predicted scores themselves, and only care about the extent to which they are (linearly or monotonically) related to the gold standard scores. These coefficients are therefore the equivalent of plotting predictions and gold standard scores in a scatterplot and studying the shape of the point cloud without bothering to write down the actual scores on the axes.

Krippendorff's α addresses all of these points: it is applicable to balanced as well as skewed datasets; it can be used in labelling tasks with any number of labels and is sensitive to performance on all of the labels; α can be defined for labelling tasks (nominal or ordinal scale) as well as scoring tasks (interval or ratio scale); and in the case of the latter it is sensitive to the actual scores, not just the shape of their distribution.

The price we pay for this versatility is twofold: First, in NLP evaluation this is not a common measure, so people will generally not be familiar with it. Secondly, the definition and calculation of α in the general case is less transparent than for any of the nominal measures above, but it is about on the same level as the correlation measures. In SuperLim, we address part of this latter point by supplying a standard evaluation script with an implementation of Krippendorff's α.

Krippendorff's α

Krippendorff's α originates from the domain of content analysis, where it is used to quantify the level of agreement between two annotators (or: coders ). It is regularly encountered in NLP annotation creation projects, too, to estimate the reliability of the annotation. Krippendorff's α takes values between -1 and 1, where 1 indicates perfect agreement between annotators and values around 0 indicate the lack of agreement above chance level. Clearly negative values for α indicate systematic disagreement. When we apply α to evaluation, what we do is pretend the system predictions are one set of annotations, and the gold standard is another. The level of agreement between these two is our measure of performance of the system.

Krippendorff's α is parameterized by a function that measures the distance between two annotations for a given item. For instance, in the case of comparing unordered labels, this function just returns 1 if the labels are the same, and 0 if they are not; but in the case of comparing scores this function might return the difference between scores in absolute or even in relative terms. For the different tasks in SuperLim, we use two different parameterizations:

Nominal for _labelling tasks_: We use _nominal α_ for the tasks of **dalaj-ged-superlim** , **swewic** , **swewinogender** and **swewinograd** (2 labels) and **argumentation-sentences** , **swediagnostics** , and **swenli** (3 labels). Note that on the binary tasks, nominal α is equivalent to macro-averaged f-score scaled to [-1,1].
Interval for _scoring tasks_: We use _interval α_ for the tasks **absabank-imm** , **supersim-superlim-relatedness** , **supersim-superlim-similarity** and **sweparaphrase**. These tasks all involve (average) scores on a scale from human annotators. Unlike a correlation measure, the agreement measure interval α _is_ sensitive to the absolute difference between a predicted score and a gold score. A good system therefore not only gets the overall shape of the relation between the scores right, but it also puts the predicted scores numerically close to the gold standard scores.

Pseudo-α

The tasks sweanalogy , swefaq and swesat-synonyms are what we call selection tasks. To perform them, the system has to try to pick the correct item. This could be an item from a small set, such as in a multiple choice task, or it could be an item from a large or even infinite set, such as in a word production task, in which the word comes from a large or open vocabulary. We evaluate system performance by checking whether the system selected the item that is given in the gold standard. The result is a type of accuracy.

These tasks are not conceptually compatible with Krippendorff's α. However, we can reformulate the task after the fact so that we may use nominal α to evaluate. Instead of picking a single item, we pretend the system has made a number of binary predictions: "no" for each item it didn't pick and "yes" for the item it did pick. This is a binary labelling task that we can evaluate using α. What is more, we do not need to actually perform the reformulated task, since we can calculate α on the binary reformulation from the accuracy on the original task. Because of this implicit task reformulation, we call the resulting score pseudo-α.

The specific relation between accuracy and pseudo-α depends on the number of items to choose from, which is a different for each of the tasks. This is why these tasks come with their own definitions of pseudo-α.

No α

The summarization task swedn is evaluated by comparing the system generated summaries to the gold standard summaries. This is done using the ROUGE-1 measure. We are not aware of a way of formulating this measure or something similar in terms of Krippendorff's α, and given the different nature of the tasks within the context of SuperLim we will not currently attempt to find one.