Beyond-accuracy perspectives Recommender systems

How do users perceive recommender systems’ objectives?

In multi-objective recommender systems, system-side metrics are often used both to optimize and to label user-facing controls, but this practice can misalign with users’ conceptual understanding of objectives, which in turn undermines tuning effectiveness, transparency, and satisfaction. Empirical measurement of perception can enable more interpretable objective controls and more defensible metric choices by explicitly linking objective operationalizations to what users actually experience.

Modern recommender systems are increasingly expected to balance multiple objectives rather than optimizing relevance alone. Common “beyond-accuracy” objectives (such as novelty, diversity, and exploration) are frequently treated as universal qualities that can be measured, optimized, and exposed to users through simple controls.

This assumption is fragile for two reasons. First, an objective label (e.g., “diversity” or “exploration”) does not guarantee a shared meaning between system designers and users. Second, even if the label is understood, there are many plausible ways to operationalize the objective, and these alternatives can produce noticeably different recommendation experiences. When objectives are embedded into user controls, any mismatch becomes a user experience issue: the system may appear opaque, the controls may feel ineffective, and users may infer incorrect trade-offs.

In a study, in cooperation with Patrik Dokoupil and Ladislav Peška, and published in the Proceedings of ACM RecSys ’25, we introduce a user-centered analysis of how people understand and perceive common recommender objectives, and how those perceptions relate to widely used metric operationalizations.

The work addresses a gap that matters for both evaluation and interface design: objective metrics are routinely used as proxies for user experience, yet their alignment with user perception, especially across domains and across alternative operationalization, is rarely examined in a unified way.

High-level solution overview

We study objective alignment through a controlled, interactive user study in two domains (books and movies). Participants receive recommendations in repeated iterations and evaluate what they experienced along multiple objective dimensions (including relevance and several beyond-accuracy and “inverse” notions such as popularity, uniformity, and exploitation).

At the same time, we compute corresponding system-side metrics on the recommendation lists, including multiple operationalization variants for several objectives. This lets us compare (i) how users conceptually interpret objective terms, and (ii) how strongly different metric variants track the perceptions users report after interacting with the system.

Our approach

The figure captures the core methodological idea: we do not only evaluate static recommendation lists, but also observe perception and control in an iterative setting where users repeatedly see recommendations, act on them, and then reflect on objective qualities.

Separating “conceptual meaning” from “metric operationalization”

A central challenge is that disagreement can happen at two levels. Users may interpret an objective label differently from the system’s intended meaning (conceptual mismatch), and even when the concept is shared, a specific metric may still fail to capture what users perceive (operationalization mismatch). We explicitly measure both levels by collecting users’ interpretations of objective terms early in the study, and their perceived qualities later after interacting with recommendations.

This separation matters because it distinguishes two different fixes: improving explanations and onboarding to align meanings, versus improving the metrics (or their domain-specific instantiations) to better capture experience.

Testing multiple operationalizations rather than treating a metric as the objective

For several objectives, we treat operationalization as a design choice rather than a given. Many objectives depend on how item similarity or distance is defined; different representations can induce different notions of “difference,” “novelty,” or “departure from prior interests.” By comparing multiple variants, we can ask a practical question: which operationalization most consistently reflects what users report, and does the answer change by domain?

This matters because it reframes offline evaluation: if the “best-aligned” metric depends on domain and representation, then reporting a single conventional metric can create a misleading sense of progress.

Embedding objectives into interaction and controllability

A further ingredient is that users experience objectives not as abstract scores but as repeated trade-offs. We therefore observe perception in a setting that includes both a standard recommendation condition and controllable multi-objective conditions where users can express propensities toward objectives via an interface.

This matters because controllability is exactly the scenario where misunderstanding becomes visible: if users do not share the system’s meaning of an objective, adjusting the control is unlikely to yield the intended experiential change.

Linking perception, behavior, and overall satisfaction

Finally, we do not treat perceived objectives as isolated. We relate perceived qualities to user actions (what users select) and to overall satisfaction, to understand how objective experiences combine into a holistic judgment.

This matters because it moves beyond “does the metric correlate with its label?” toward “which objective experiences actually appear to support satisfaction in this setting, and under what balance?”

Findings and insights

Users often recognize objective terms, but their meanings are not stable across people, especially for beyond-accuracy objectives. Relevance tends to be interpreted more consistently, while concepts such as diversity, novelty, and exploration show substantial conceptual fragmentation. This immediately suggests that simply exposing these objectives as controls (or reporting them as evaluation dimensions) can create a false sense of shared understanding.

On the operationalization side, we find partial alignment: many objective metrics track perceived qualities to some degree, but the mapping is neither strong nor uniform. Different operationalizations can behave differently across domains, and some representation choices appear systematically less effective as proxies for perception. The broader implication is that “objective evaluation” is not only about choosing which objective to measure, but also about choosing how to instantiate it in the specific domain.

A particularly important insight is cross-objective overlap. Metrics intended for one beyond-accuracy objective can align similarly well with perceptions of other beyond-accuracy qualities. Conceptually, this indicates that common operationalizations entangle multiple experiential factors: what feels “novel” may also feel “diverse,” and what feels “exploratory” may co-vary with both. Practically, it cautions against interpreting metric improvements as objective-specific wins without additional evidence that users experience the intended change.

Serendipity stands out as difficult to capture. Users’ reported serendipity shows weak relationships with most measured variants, reinforcing the idea that serendipity is not easily reducible to standard list-level signals in this kind of interaction setting. This limits how confidently we can optimize for serendipity using conventional offline proxies.

Finally, the study highlights a tension that is easy to miss when beyond-accuracy objectives are treated as uniformly beneficial. In this setting, overall satisfaction tends to align more with signals associated with familiarity and proximity to known preferences (including inverse notions such as popularity, uniformity, and exploitation) than with aggressively maximizing novelty, diversity, or exploration. The takeaway is not that beyond-accuracy objectives are undesirable, but that their intensity and balance matter: users may appreciate them up to a point, after which they can degrade the experience.

Importantly, when users’ conceptual understanding matches the system’s intended meaning, the metric–perception alignment improves for several objectives and operationalizations. This suggests that part of the “metric gap” is not purely a measurement failure; it is also a communication and mental-model problem.

Conclusions

This work reframes multi-objective recommendation as a problem of alignment among three layers: objective labels, metric operationalizations, and user experience. The key contribution is evidence that both conceptual mismatch (what users think an objective means) and operationalization mismatch (what a metric captures) can be substantial, and that controllable interfaces amplify the consequences of misunderstanding.

Several research directions follow naturally. One is to design mechanisms that actively align meanings (through explanations, interactive onboarding, or interface feedback that clarifies trade-offs) then measure whether alignment causally improves both user control and metric validity. A second is to develop domain-adaptive or user-adaptive operationalizations that better reflect how “difference” and “novelty” are perceived in specific contexts, rather than assuming a single representation is appropriate everywhere. A third is to treat cross-objective entanglement as a first-class phenomenon: instead of optimizing isolated proxies, we can model the latent experiential factors that metrics jointly capture, and expose controls that correspond to those factors more directly.

Overall, the paper argues for a more user-grounded view of objectives: if objective-driven recommender systems are meant to be transparent, controllable, and trustworthy, then validating what objectives mean to users (and how we measure them) becomes part of the core technical agenda rather than an afterthought.