How To Read a Board

SuperLim is a collection of natural language understanding tasks that are very useful to determine the potential performance of a pre-trained language model. While it is great for model developers to see their model reach the top of the overall leaderboard, this model might not be the best fit for the user's needs. Users often have hardware limitations that restrict their choice to models up to a certain size, or they are only interested in one particular task, whose performance is best predicted by one or more specific sub-tasks in SuperLim.

We therefore advocate for personalized leaderboards that filter out irrelevant tasks and models and sort with respect to any remaining column.

Examples

These examples give an idea on how different users see the leaderboard and how they can find their personal "best" model.

select models that can be run on a standard GPU
deselect all word-level tasks (SuperSim, SweAnalogy, Swesat Synonyms, SweWic)
select all sentence-pair tasks (SweDiagnostics, SweMNLI, SweParaphrase, SweWic, SweWinogender)
select all encoder-decoder models
switch between results on the development/validation and the test split
any combination of the above
...

What is Krippendorff's Alpha and Why do we use it?

The short answer

Krippendorff's α is a score between -1 and 1, in which 1 is the score for systems that answer every item in a test set correctly, and a score around 0 means that the system performs around chance level. Clearly negative scores mean that the system systematically gives incorrect answers.

SuperLim 2.0 uses Krippendorff's α as its default evaluation measure because it allows us to use a single measure for as many as tasks as possible: it can be applied to balanced and unbalanced test data sets, to labelling tasks with two or more classes, as well as to scoring tasks.

Read the longer answer here