Explainability Recommender systems

Reproducibility of Multi-Objective Reinforcement Learning Recommendation: Interplay between Effectiveness and Beyond-Accuracy Perspectives

Controlling various objectives within Multi-Objective Recommender Systems (MORSs). While reinforcing accuracy objectives appears feasible, it is more challenging to individually control diversity and novelty due to their positive correlation. This raises critical questions about the effectiveness of incorporating multiple correlated objectives in MORSs and the potential risks of not having control over them.

In a reproducibility study conducted with Vincenzo Paparella, Vito Walter Anelli, and Tommaso Di Noia, published in the proceedings of ACM RecSys ’23, we reproduce and extend a state-of-the-art Multi-Objective Reinforcement Learning framework in Recommender Systems, focusing on the challenges of balancing accuracy, diversity, and novelty, while also assessing the impact on algorithmic bias.

Background

We reproduced the WSDM ’22 paper by Stamenkovic et al. presenting a Scalarized Multi-Objective Reinforcement Learning (SMORL) approach for session-based recommendations. This approach, formulated as a Multi-Objective Markov Decision Process, focuses on generating relevant, diverse, and novel recommendations using a single Reinforcement Learning agent. Key components of the model include a continuous state space, a discrete action space, and a multi-objective Q-value function. The model optimizes these objectives using a generative sequence model and the Scalarized Deep Q-learning algorithm. Rewards are defined for each objective, with specific criteria for accuracy, diversity, and novelty, to guide the recommendation process effectively.

Experiments

Experimental Setup

Datasets. For our study, we employ the RC15 and Retailrocket datasets, which were also used in the original study. This allows us to ensure consistency in the replication process and provides a foundation for evaluating the SMORL approach.

Baselines. Our paper compares the SMORL framework with established baselines, including GRU4Rec, Caser, NextItNet, and SASRec. This comparison assesses the effectiveness of SMORL against traditional models in Recommender Systems and understands its relative performance improvements or shortcomings.

Evaluation Protocol. Our evaluation of the models includes a 5-fold cross-validation approach. As performance metrics, we evaluate accuracy through Hit Ratio and normalized Discount Cumulative Gain (nDCG), and beyond-accuracy objectives via novelty and diversity (assessed via the Item Coverage Metric) and the Repititeveness of the recommendations.

Results and Discussion

Accuracy Objectives. SMORL generally enhances accuracy across most datasets and baseline comparisons, confirming the effectiveness of the approach in achieving this primary objective of Recommender Systems.

Beyond-Accuracy Objectives. Here, the focus shifts to the diversity and novelty aspects of the recommendations. The results are mixed, suggesting that while SMORL can improve accuracy, achieving a balance with beyond-accuracy objectives like diversity and novelty is more challenging.

Influence of Objective Weights. We explore the impact of varying weights assigned to different objectives within the SMORL framework. Results indicate the complexity of controlling multiple objectives simultaneously and how these weights can significantly affect the overall performance of the recommender system.

Assessing algorithmic bias

We extend the evaluation of the SMORL to address the algorithmic bias perspective. Considering that most existing algorithmic bias metrics are not time-aware, we adapt these metrics to suit the sequential and session-based nature of the recommendations:

  1. Bias Disparity (BD): This metric assesses the difference between the input bias and the recommendation bias. It helps identify if the recommendation algorithm amplifies existing biases in the data source.
  2. Ranking-based Statistical Parity (RSP): RSP is based on statistical parity, which aims to equalize the ranking probability distributions across different item categories. Lower values of RSP indicate less bias, meaning that the recommendation system is treating different item categories more equally in terms of their appearance in the top rankings.
  3. Ranking-based Equal Opportunity (REO): Contrary to RSP, REO is based on the concept of equal opportunity. This metric looks at the equity of ranking probabilities, considering the items enjoyed in the sessions. It assesses whether items, especially from less popular or niche categories, get fair exposure in the recommendations.

Results

The key findings can be summarized as follows:

  1. Impact on Bias Disparity. While SMORL can diversify recommendations, it is responsible for increasing the bias disparity in the output, particularly for certain categories of items. This indicates that while the SMORL approach can enhance certain aspects of recommendations like diversity and novelty, it may also inadvertently increase bias, especially in categories that are either less popular or have inherent biases in the dataset.
  2. Item Category and Popularity. SMORL-based models displayed greater values of bias disparity in certain item categories compared to vanilla baselines, thereby reducing the exposure of these items. This suggests that the SMORL framework, while being effective in recommending relevant items, might contribute to lesser visibility of niche or less popular items.
  3. Provider-Side Unfairness. The results also indicate a potential provider-side unfair situation, where relevant item-based rewards in SMORL could reduce the exposure of highly niche items. This is crucial from a fairness perspective, as it highlights the challenge of balancing user-centric objectives (like relevance, diversity, and novelty) with fairness towards item providers.
  4. Complexity in Controlling Objectives. The study underscores the difficulty in controlling the influence of individual objectives within the SMORL framework. This is particularly evident in the context of diversity and novelty, which are often positively correlated, making their separate optimization challenging.

Conclusions

Our study primarily focused on reproducing and extending the Scalarized Multi-Objective Reinforcement Learning (SMORL) framework by Stamenkovic et al., currently representing the state of the art in the context of multi-objective recommendation.

Our replication efforts not only validated the original framework’s effectiveness in optimizing for accuracy, diversity, and novelty in recommendations but also opened new avenues of exploration. A critical extension of our study was the in-depth analysis of algorithmic bias within the SMORL framework. We rigorously assessed how the framework, while enhancing user-centric objectives, interacts with and influences algorithmic fairness and bias, particularly in item popularity and category.

Moreover, our findings revealed the nuanced complexities in controlling multiple objectives within MORSs. We observed that advancements in certain recommendation objectives could inadvertently lead to challenges in others, especially concerning algorithmic bias. This finding underscores the delicate balance required in optimizing recommender systems for multiple goals.

Concluding our study, we highlighted the importance of reproducibility in MORS research. By successfully replicating and extending the SMORL framework, we not only reaffirmed its value in the recommender systems field but also contributed new insights into the interplay between diverse recommendation objectives and the implications on fairness and bias. This work paves the way for future research focused on creating more balanced, fair, and effective recommender systems.