Explainability Recommender systems

A Comparative Analysis of Text-Based Explainable Recommender Systems

We reproduce and benchmark prominent text-based explainable recommender systems to test the recurring claim that hybrid retrieval-augmented approaches deliver the best overall balance between explanation quality and grounding. Yet, prior evidence is hard to compare because studies diverge in datasets, preprocessing, target explanation definitions, baselines, and evaluation metrics. Under a unified benchmark on three real-world review datasets, we find that hybrid approaches are generally strongest, but the conclusion is reliable only when retrieval and evaluation choices reflect realistic and aligned settings.

Textual explanations are one of the most direct ways to make recommender systems inspectable: they expose the rationale behind a suggestion in a form users can read, contest, and act upon. In review-driven domains, explanations are often expected to be both fluent and faithful to item evidence, which creates a tradeoff between language generation and grounded justification.

Over the last few years, three main families of solutions have emerged.

  • Generation-based models learn to produce an explanation sentence, typically jointly with recommendation-related objectives.
  • Extraction-based models select an explanation sentence from review corpora, implicitly grounding the output in observed text.
  • Hybrid approaches combine retrieval and generation, aiming to keep the flexibility of generation while anchoring it in retrieved evidence.

In a study, in cooperation with Alejandro Ariza-Casabona and Maria Salamó, and published in the Proceedings of RecSys 2024, we conduct a reproducibility-driven comparative analysis of text-based explainable recommender systems.

In this paper, we reproduce and align representative methods across generation-based, extraction-based, and hybrid categories. Even when source code exists, experiments often depend on undocumented preprocessing, implicit target construction choices, or incomplete artifact releases. Our contribution is to make these dependencies explicit and to evaluate how much they affect conclusions about what is robust versus fragile in text-based explainable recommendation.

Reproducibility goal and scope

Conceptually, we ask what should reappear if the literature’s category-level claims are stable: hybrid retrieval-augmented generation should provide strong overall behavior, extraction should be safest for evidence grounding, and generation should excel when rich data supports personalization. We test these expectations by running all selected methods under a shared benchmark definition, so that differences are attributable to modeling choices rather than to incompatible pipelines.

Our scope is cross-model and cross-category. We primarily rely on released artifacts, but we also reconstruct missing pieces when a method cannot be executed as described or when the repository omits steps required for comparable evaluation. We hold fixed the benchmark logic (datasets, target construction principles, evaluation rules) and document unavoidable deviations (such as incomplete graph construction code or missing retrieval components).

Reproduced methods

Generation-based models (learn to generate an explanation sentence)

  • NRT (2017): multitask neural recommender that jointly predicts preference and generates short “tip”-style explanations.
  • PETER (2021): transformer-based multitask model that generates explanations conditioned on user–item signals.
  • SEQUER (2023): sequential variant that conditions explanation generation on interaction history (next-item/sequence-aware signals).
  • POD (2023): adapts a pre-trained language model to recommendation + explanation via prompt-oriented distillation.

Extraction-based models (select an explanation sentence from observed reviews)

  • ESCOFILT (2021): builds sentence-level profiles and selects representative sentences as explanations (evidence-constrained by construction).
  • GREENer (2022): graph-based ranking of candidate sentences/attributes derived from heterogeneous user–item–text structure, with optional re-ranking to manage redundancy.

Hybrid models (combine retrieval/extraction with generation)

  • ERRA (2023): retrieves relevant attributes/sentences and generates explanations with aspect-aware conditioning to improve grounding and specificity.
  • ExBERT (2023): hybrid profile-and-retrieve setup coupled with transformer-based generation, aiming to keep fluency while anchoring content in retrieved evidence.

Methodology

Benchmark harmonization across different “explanation” definitions

A core comparability problem is that models operationalize explanations differently: some assume a single target sentence per interaction, while others naturally support multiple plausible sentences. If we evaluate them under their native assumptions, we are not reproducing the same phenomenon.

We address this by adopting a unified benchmark in which each interaction may have multiple valid target sentences, but every model outputs a single explanation sentence. This introduces a clear constraint (one output per interaction) while preserving the reality of multiple acceptable explanations. This reduces the likelihood that a model is penalized simply for choosing a reasonable alternative sentence.

Artifact reconstruction and controlled deviation management

Several selected methods have partial or inconsistent artifacts, where crucial steps are missing or differ from the published description.

Our strategy is to preserve methodological intent when execution requires reconstruction. For example, we implement missing extraction components when a repository cannot produce explanations as described, and we rebuild graph or retrieval inputs when dataset-to-structure scripts are incomplete. We treat these interventions as part of the reproduction protocol, because otherwise the comparison would silently become a comparison of “what the repositories currently do” rather than “what the methods claim to do.”

Hallucination-aware evaluation

Standard similarity-based text metrics can reward fluent, plausible language even when it is not supported by the recommended item’s evidence. To address this, we introduce an evidence-based notion of feature hallucination: an explanation is considered to hallucinate a feature if that feature is never mentioned for the recommended item anywhere in the dataset’s review evidence.

This choice adds a grounding constraint that is directly relevant for interpretability. It allows us to separate “good-looking explanations” from “item-supported explanations,” which is essential when explanations are used as trust signals rather than as purely stylistic text.

Sensitivity checks to isolate what drives performance

Finally, we treat sensitivity analysis as part of the reproduction. We examine how outcomes change when reasonable alternative settings are applied to key mechanisms that are often under-specified in original papers: how generation models sample targets during training, whether extractive models apply diversity-oriented re-ranking when only one sentence is required, and how much hybrid performance depends on retrieval quality.

Findings and insights

Hybrid models are generally strongest overall, but their advantage depends critically on retrieval realism and aligned evaluation choices. We consistently observe that retrieval-augmented hybrids can mitigate weaknesses seen in pure generation and pure extraction, but the “hybrid wins” conclusion is not unconditional.

Extraction-based methods are the most reliable for limiting unsupported attribute mentions under an evidence-based grounding lens. Selecting explanations from observed review text constrains ungrounded content, and this effect is especially visible when user–item signals are sparse or noisy, where purely generative approaches are more prone to defaulting to broadly plausible but weakly supported attribute language.

Generation-based models are competitive on text similarity and fluency when the domain provides enough data, but they are more sensitive to sparsity and generic-language failure modes. In richer settings, learned interaction representations can support personalized explanation content, whereas in sparser settings models more often fall back on repetitive or high-frequency phrasing that is plausible in aggregate but less diagnostic for the specific recommendation.

Popularity effects interact with hallucination behavior in non-trivial ways, which makes grounding diagnostics necessary alongside standard quality metrics. Popular features are structurally easier to “justify” because they appear for many items, but models can also over-predict them, and the resulting errors are not captured consistently by similarity-centric metrics. This separation between plausibility and evidence support is one of the main interpretive lessons of the reproduction.

Ablation-style sensitivity checks show that several reported advantages are contingent on under-specified pipeline choices rather than solely on model family. Allowing generation models to sample from multiple valid target sentences can materially change outcomes, diversity-oriented re-ranking in extraction can help or hurt depending on base ranking reliability when only one sentence is required, and weakening retrieval in hybrids can reduce their apparent edge toward strong generation baselines.

Conclusion

This reproducibility study strengthens confidence in several high-level claims while sharpening their conditions. Extraction-based approaches are the safest candidates when grounding and hallucination risk are central concerns, particularly in sparse domains. Generation-based models can exhibit strong personalization signals in richer settings but require careful scrutiny for repetition and over-generalization behaviors. Hybrid approaches often achieve the best overall balance, but only when retrieval is accurate and the evaluation protocol does not implicitly grant unrealistic information.

For future work, the most actionable directions are methodological. Studies should report target construction decisions, evaluation rules for multiple valid explanations, and retrieval assumptions as first-class experimental specifications. Artifact releases should include end-to-end dataset-to-input pipelines, since missing intermediate construction steps are a recurring source of hidden variability. Finally, hallucination and popularity-driven over-generalization should become routine evaluation lenses for text-based explainers, and user-centric evaluation is the natural next step once the community converges on reproducible, comparable benchmarks.