Rows or Columns? Minimizing Presentation Bias When Comparing Multiple Recommender Systems

Under presentation bias, the attention of the users to the items in a recommendation list changes, thus affecting their possibility to be considered and the effectiveness of a model. When comparing different layouts through which recommendations are presented, presentation bias impacts users clicking behavior (low-level feedback), but not so much the perceived performance of a recommender system (high-level feedback).

In a study conducted with Patrik Dokoupil and Ladislav Peska, and published in the proceedings of SIGIR ’23, we study if the user’s perception of recommender systems changes, when the results are presented on an advantaged or disadvantaged position. To do so, we utilize a page-wise within-subject design (PageWS), where the evaluated options are different recommendation lists and are displayed in parallel within a single page.

PageWS reduces the carry-over effect and simplifies the relative comparison, since the different options are displayed side by side. However, due to the uneven distribution of user’s attention over different page regions, PageWS are affected by presentation bias [2]. In other words, in the comparison, there is always an algorithm whose results are presented before the results of the other and, thus, appears in an advantaged position.

Thanks to PageWS, we evaluate two perspectives. On the one hand, we route participants to a specific sub-task in question or ask them to also provide some high-level feedback, e.g., to evaluate the overall performance of individual recommender systems. On the other hand, we evaluate how position bias affects aspects such as items’ clicking (in this work, we will refer to this type of action as low-level feedback).

User study design

The study was conducted on top of MovieLens-Latest dataset [8]. As a pre-processing, user feedback was binarized, only recent movies and ratings were kept and movies and users with insufficient volume of ratings were removed. See supplementary materials for details. The final dataset contained 9K users, 2K movies, and 1.5M ratings.

We used two recommenders, denoted as Delta and Gamma, in the study. Gamma follows a generalized matrix factorization example from tf.recommenders with embedding dimension of 32 and 5 training epochs. Delta is a multi-objective RS using Gamma as the relevance component, while adding also novelty and diversity objectives. In particular, we used an incremental weighted average, with uniform importance scores to all three objectives.

We experimented, in total, with six PageWS layouts. At a high level, the first decision is whether to stack the results of the individual RS horizontally or vertically (producing column-wise and row-wise layouts respectively). Next, we need to decide how to display items within the area allocated for each algorithm (see the following figure). In column-wise layouts, we experimented with displaying only one item per row (Columns:single), two items per row (Columns:double), and displaying as many items as can fit (Columns:max). In row-wise layouts, we experimented with putting all items to a single row (Rows:single), distributing results of RS over two rows (Rows:double), and using a fix-sized single-row display area, where additional items are accessible via scrollbar (Rows:scroll).

Results

Overall, 138 participants completed the whole survey with a median time to complete of 13 minutes.

While our paper contains the detailed results, here are the main outcomes that emerge from them:

Presentation bias has a considerable impact on the ordering of received feedback; column-wise layouts provide better distribution of feedback than row-wise layouts. Presentation bias also has some impact on the volumes of low-level feedback, but the differences between individual layouts are rather limited.
The average volumes of selected items from advantaged and disadvantaged layout slots are close to uniform for all layouts (no stat. sign. differences) if the selections of mutually recommended items are removed from the dataset.