Algorithmic fairness Recommender systems

How Fair is Your Diffusion Recommender Model?

In generative recommender systems, adopting diffusion-based learning primarily for accuracy often reproduces the biased interaction distributions present in historical logs, which results in systematic disparities for both users and items. Fairness-aware auditing can enable responsible diffusion recommendation by revealing when utility gains are obtained through consumer- or provider-side inequities, as instantiated in this study.

We increasingly deploy recommender systems in settings where the ranking is not only an optimization artifact but also a mechanism that allocates attention, opportunity, and visibility. Hence, user-facing utility (who receives relevant recommendations) and system-facing exposure (which items receive impressions) are coupled: improving relevance for the majority can still degrade outcomes for protected user groups, and concentrating exposure can reinforce popularity dynamics that disadvantage niche providers.

Diffusion-based recommender systems are emerging as a strong generative alternative because they learn to reconstruct interaction signals from noise and can be robust to imperfect implicit feedback. Conceptually, however, generative modeling also inherits a central risk: if historical data encode structural biases, “learning the distribution well” can mean replicating (or amplifying) the same disparities. For diffusion recommender systems, this problem is particularly relevant, because the training objective is explicitly tied to recovering the observed interaction distribution.

In a study, in cooperation with Daniele Malitesta, Giacomo Medda, Erasmo Purificato, Mirko Marras, and Fragkiskos D. Malliaros, and published in the Proceedings of ACM RecSys 2025, we introduce a first empirical fairness analysis of diffusion-based recommendation centered on DiffRec and its lighter variant L-DiffRec.

While diffusion recommenders have quickly become attractive for their utility, yet their fairness behavior in recommendation settings had not been examined with the same rigor commonly applied to other model families. We therefore ask whether a pioneer diffusion recommender is fair relative to strong, widely used baselines when fairness is assessed from both consumer and provider perspectives.

High-level solution overview

We benchmark diffusion-based recommendation against a diverse set of established recommenders across two domains with available sensitive user attributes. We evaluate utility and fairness in two complementary ways.

First, we examine each dimension separately to understand whether diffusion recommenders exhibit systematic disparities for users or items when considered in isolation. Second, we adopt a multi-criteria perspective to clarify whether diffusion recommenders can simultaneously support good utility and equitable outcomes, or whether improvements in one dimension consistently come at the expense of another.

By including both DiffRec and L-DiffRec, we also obtain an opportunity to contrast two diffusion design choices (namely, operating directly over interaction space versus operating in a clustered latent space) and to interpret how these choices may relate to fairness outcomes.

Our approach

Operationalizing fairness as disparities across groups

We frame consumer fairness as parity in achieved recommendation utility across user groups defined by a sensitive attribute available in the datasets. We measure fairness as a difference in outcome: if two user groups receive meaningfully different ranking quality, the system is operationally unfair from the consumer perspective.

We frame provider fairness as parity in how exposure is distributed across item groups. Concretely, we consider a partition between popular items and long-tail items, and we ask whether ranked recommendations systematically concentrate visibility on the already popular side. Given that exposure is a primary mechanism through which recommenders shape market dynamics, repeated concentration can affect discovery and weaken the viability of niche or emerging items.

Designing a controlled comparative benchmark

To isolate how diffusion modeling relates to fairness, we evaluate diffusion recommenders alongside strong baselines that represent different modeling families, including traditional collaborative filtering, neural methods, graph-based models, and other generative approaches. The goal is not to establish a single best model but to situate diffusion recommenders within a landscape where utility and fairness patterns are already known to vary.

We evaluate all approaches on the same datasets and under the same recommendation setting (top-ranked lists), so that differences in observed fairness are attributable to ranking behavior rather than to mismatched evaluation contexts.

Contrasting diffusion design choices through DiffRec vs L-DiffRec

A core methodological lever in the study is the inclusion of L-DiffRec as a diffusion variant that changes how the denoising process is applied. At a conceptual level, DiffRec treats recommendation as reconstructing a user’s interaction signal through iterative denoising, aiming to recover preference patterns while discounting noise in implicit feedback.

L-DiffRec introduces an additional abstraction: it groups items into clusters and performs diffusion in a compressed latent representation tied to these clusters. This effectively changes the “resolution” at which the model learns the interaction distribution. From a fairness point of view, this is important because it can alter which signals are easiest to recover: models that more directly track dominant popularity patterns may reinforce exposure concentration, whereas models that operate on coarser or structured representations may be less sensitive to popularity-driven imbalances.

By comparing DiffRec and L-DiffRec under identical fairness measurements, we can interpret whether seemingly small architectural and representational choices in diffusion recommendation are associated with systematic changes in fairness outcomes.

Moving beyond single-metric reporting with a trade-off view

Because utility, consumer fairness, and provider fairness can conflict, we complement single-metric inspection with a trade-off analysis that places these dimensions on a shared footing. The key idea is not to collapse everything into one score, but to reveal whether a model’s strong utility is structurally tied to disparities, and whether some models occupy a more balanced region of the outcome space.

This multi-dimensional view helps distinguish two situations that can look similar under standard accuracy-only evaluation: (i) models that are strong because they are broadly beneficial across groups, and (ii) models that are strong because they optimize for dominant patterns while leaving minority groups or long-tail items behind.

Findings and insights

Across datasets, diffusion-based recommendation shows a consistent theme: strong utility does not guarantee equitable outcomes, and diffusion learning can align closely with the biases already present in interaction logs. In particular, DiffRec tends to exhibit unfairness patterns that are compatible with a bias-amplification concern: when the model learns to recover the historical interaction distribution effectively, disparities in observed data can reappear as disparities in recommendations.

A notable insight is that fairness behavior is domain-sensitive. In the movie domain dataset, diffusion-based recommendation is among the least fair approaches from a consumer perspective, while some non-diffusion baselines show a more balanced profile. In the location check-in domain dataset, the picture changes: several non-diffusion approaches are comparatively fairer on the consumer side, and diffusion behaves differently depending on the variant. This suggests that the relationship between diffusion learning and consumer fairness is mediated by how preferences are expressed in the domain, how interaction noise manifests, and how group-level behavioral differences shape the observed data.

On the provider side, the contrast between DiffRec and L-DiffRec is particularly informative. L-DiffRec tends to produce fairer provider-side outcomes than DiffRec, indicating that diffusion recommendation is not inherently incompatible with fairness. Instead, representational choices (such as operating in a clustered latent space) appear to influence whether the model simply follows popularity signals or captures patterns that broaden exposure. This is an encouraging result because it shifts the discussion from “diffusion is unfair” to “diffusion requires careful adaptation to avoid reproducing structural skew.”

The trade-off analysis sharpens these conclusions. DiffRec often occupies a region where improvements in utility are accompanied by deteriorations in consumer and provider fairness, consistent with a model that is highly aligned with dominant historical patterns. L-DiffRec, in contrast, can move closer to a balanced region in which utility and fairness are less antagonistic, and in some settings it achieves a particularly favorable balance. The broader takeaway is that diffusion recommenders are not “plug-and-play” with respect to fairness: without explicit attention, they can prioritize the majority and the head of the catalog, even when standard accuracy metrics look strong.

Conclusions

This study positions diffusion-based recommendation within the broader fairness agenda by providing an outcome-based audit that jointly considers consumers and providers. The main conceptual contribution is the empirical evidence that diffusion recommenders can reproduce or exacerbate disparities already embedded in interaction data, alongside the complementary evidence that diffusion variants with structured representation choices can mitigate these effects.

Several research directions follow naturally from these findings. We can extend fairness auditing to a wider set of diffusion recommenders and recommendation tasks, including those where temporal dynamics are central, to test whether bias amplification changes under different feedback structures. We can also investigate where unfairness emerges inside the diffusion recommendation pipeline, whether disparities are primarily inherited from training data, from how denoising prioritizes frequent patterns, or from how generated scores translate into exposure in top-ranked lists. Finally, we can move from auditing to design by integrating fairness constraints or fairness-aware guidance into diffusion-based learning and sampling, and by evaluating these approaches under richer sensitive attributes and multi-group settings to better reflect real deployment conditions.