Evaluating the Prediction Bias Induced by Label Imbalance in Multi-label Classification

Prediction bias is a well-known problem in classification algorithms, which tend to be skewed towards more represented classes. This phenomenon is even more remarkable in multi-label scenarios, where the number of underrepresented classes is usually larger. In light of this, we present a novel measure that aims to assess the bias induced by label imbalance in multi-label classification.

In a CIKM 2021 paper, with Luca Piras and Guilherme Ramos, we propose the Prediction Bias Coefficient (PBC), an indicator that measures the strength of the correlation between the label imbalance in a dataset and the performance obtained by a multi-label classifier trained on the same dataset.

The approach exploits the Spearman’s rank correlation coefficient between the label frequencies and the F-scores obtained for each label individually. In terms of algorithmic fairness, we can interpret the PBC indicator as a way of capturing the intensity of the classifier’s discrimination against the minority classes.

The Prediction Bias Coefficient

Our coefficient is built by considering two main quantities:

The frequency of a label 𝑙 (freq), which is estimated as the proportion of label lists in the training set that contain 𝑙;
The binary F-score (F), is computed separately for each label 𝑙 by checking the ground truth lists and the prediction lists that contain 𝑙.

The Prediction Bias Coefficient (PBC) is calculated as the Spearman’s rank correlation coefficient between the proportion variables 𝑓𝑟𝑒𝑞 and 𝐹. You can refer to our paper for its mathematical formulation.

Results

The datasets employed in our analysis are Reuters-215784, consisting of news stories tagged with a list of economic categories, and Webscope-R45, which includes a corpus of movie synopses associated to one or more genres.

While the paper contains the detailed results, the most relevant finding is that the dataset associated with a higher degree of imbalance (namely, Webscope-R4) is also the one on which the classifier achieves, on one hand, a lower value for both the balanced accuracy and the macro-averaged F-score, and, on the other hand, a higher value of PBC. This suggests that the latter metric can capture the relation between the imbalance degree and the performance. The visual inspection offered by the scatter-plot in the following figure, which illustrates the correlation between label frequencies and label-wise F-scores, allows us to observe that, in Webscope-R4, the macro-averaged F-score is dragged down by the very poor accuracy obtained on severely underrepresented labels (visible in the bottom-left corner of the figure).