Connecting user and item perspectives in popularity debiasing for collaborative recommendation

The probability of recommending an item and of this recommendation being successful are biased against item popularity. By minimizing the correlation between a positive user-item interaction and the item’s popularity, we can avoid popularity bias. The recommendation of less popular items can come without affecting recommendation effectiveness and with a positive effect on other beyond-accuracy perspectives (novelty, coverage, and diversity).

In a paper published by the Information Processing & Management journal (Elsevier), with Gianni Fenu and Mirko Marras, we characterize the popularity bias in item recommendation and mitigate this phenomenon in state-of-the-art models, thanks to a novel strategy combining pre- and in-processing interventions.

Popularity bias assessment

To assess popularity bias, we introduce two metrics:

Item Statistical Parity (ISP) measures to what extent the recommended items are equally distributed in terms popularity. If there is a perfectly equal distribution of recommendations across items, then ISP = 1 and the statistical parity is met. ISP decreases and gets closer to 0 when the distribution of the recommendations is more unequal.
Item Equal Opportunity (IEO) measures to what extent the true positive rates of different items are the same. If there is a perfect equality of being recommended when items are known to be of interest, then IEO = 1. Conversely, IEO decreases and gets closer to 0 when the probability of being recommended is high for only few items of interest in the test set.

We assessed the behavior of these two metrics in the movie and education domains, by considering the MovieLens-1M and COCO datasets, and by studying four models, namely Random, MostsPop, NeuMF, and BPR. As the following figures show, no model can achieve ISP and IEO values that are close to 1, despite Random, who trivially can recommend items without accounting for their popularity, since they are picked randomly (nevertheless, as we show in the paper, the recommendation effectiveness is very low).

Mitigation

To mitigate popularity bias, we propose a pre- and in-processing approach that follows the pipeline in the figure.

The two main steps work as follows:

Training Examples Mining (sam). Under a point-wise optimization setting, t unobserved-item pairs are created for each observed user-item interaction. The observed interaction is replicated t times to ensure that our correlation-based regularization will work. On the other hand, under a pair-wise optimization setting, for each user u, t triplets per observed user-item interaction are generated. In both settings, the unobserved item j is selected among the items less popular than i for t/2 training examples, and among the items more popular than i for the other t/2 examples.
Regularized Optimization (reg). The training examples are fed into the original recommendation model in batches, to perform an iterated stochastic gradient descent. Regardless of the family of the algorithm, the optimization approach follows a regularized paradigm derived from the original point- and pair-wise optimization functions. The original paper contains the details of our regularization.

Results

We evaluated our mitigation approach on the same datasets and models, against state-of-the-art baselines, also considering the impact of each component (sam and reg) in the results. While the paper contains the detailed results, here are the main take-home messages:

Combining our sampling and regularization leads to higher ISP and IEO w.r.t. applying them separately. The loss in ranking accuracy is negligible with respect to the original model, if a balanced test set is considered.
Our correlation-based regularization, jointly with the tailored sampling, leads to a reduction of the gap in relevance score of items along the popularity tail. This is stronger for pair-wise approaches and sparse datasets.
Mitigating popularity with our procedure makes a positive impact on recommendation quality. Lower ISP and IEO, higher novelty, and a wider coverage are achieved at the cost of a small loss in NDCG, if the model is evaluated on a balanced test set.

Conclusions

Based on the results, we conclude that:

Predicted users’ relevance distributions for observed head and mid items are biasedly different; the first one includes higher relevance scores, on average.
The pair-wise accuracy for observed mid items is lower than the one for observed head items; mid items are under-ranked regardless of users’ interests.
The combination of our sampling strategy and our regularized loss function leads to a lower gap in pair-wise accuracy between observed head and mid items; higher statistical parity, equal opportunity, and beyond-accuracy estimates can be achieved by the models treated with our mitigation procedure.
The treated models exhibit comparable accuracy against the original model, when the same number of test observations are used for each item, which was proved to be a proper testing setup when popularity bias is considered.
Compared to state-of-the-art alternatives, our treated models nearly reduce popularity bias while achieving competing ranking accuracy and beyond-accuracy estimates, generalizing well across users’ populations and domains.