Explainable recommendation methods based on path reasoning over knowledge graphs require an end-to-end workflow that connects graph preparation, model training, explanation generation, and explanation-aware evaluation. hopwise is an open-source Python library that extends the RecBole ecosystem with interoperable datasets, path-reasoning models, and explanation-oriented evaluation tools, making systematic benchmarking and reuse practical.
Context and motivation
Path-based explainability over knowledge graphs sits at an intersection of two demands that are often treated separately: strong recommendation performance and transparent, structured rationales for why an item is suggested.
However, turning this conceptual promise into cumulative research requires infrastructure. Without shared data preparation conventions, compatible model interfaces, and agreed-upon evaluation practices for explanation paths, results remain difficult to reproduce and hard to interpret across papers. Beyond that, explainability introduces additional “quality dimensions” that do not reduce to accuracy (coverage of explainable items, diversity of reasoning patterns, and properties of the entities and interactions used to justify recommendations) creating a need for evaluation pipelines that treat explanation artifacts as first-class outputs rather than side products.
Paper positioning
In a study, in cooperation with Gianni Fenu, Mirko Marras, Giacomo Medda, and Alessandro Soccol, and published in the Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), we introduce hopwise, a resource aimed at reproducible research in explainable recommendation based on path reasoning over knowledge graphs.
The work targets a concrete gap: existing recommendation libraries provide strong support for general experimentation, but offer limited and often non-unified support for the combination of (i) knowledge-graph-based recommendation, (ii) path-based explanation generation, and (iii) explanation-quality evaluation under a standardized experimental lifecycle. Our positioning is that progress in this area depends as much on shared tooling and evaluation conventions as on new model architectures.
Resource overview
hopwise is a Python library for explainable recommendation, where explanations are k-hop paths connecting users to recommended items in a collaborative knowledge graph. We design it for researchers and research-aware practitioners who need to: (1) run comparable experiments across different path-reasoning paradigms, (2) extend baselines with minimal “glue code,” and (3) evaluate not only ranking utility but also the properties of the produced explanation paths.
At a high level, the library enables three capabilities that are rarely available together in a single workflow.
First, we provide end-to-end lifecycle support, spanning knowledge graph preparation, path sampling, model training, explanation delivery, and evaluation, while staying compatible with RecBole’s experimental conventions.
Second, we unify multiple families of path-reasoning recommender systems. The resource supports reinforcement-learning-style path reasoning and autoregressive “path language model” approaches, and it accommodates knowledge-graph embedding priors that can be refined through reasoning, enabling controlled comparisons across modeling choices.
Third, we treat explanation outputs as structured artifacts that can be evaluated systematically. The library integrates beyond-accuracy objectives and a dedicated suite of path-quality metrics, supporting analysis of when a method is accurate but produces low-coverage or low-diversity explanations, and when explanation behavior differs substantially despite similar recommendation utility.
How it was created

The figure conceptualized the hopwise pipeline where new datasets, models, utilities, and metrics plug into a shared configuration, training, and evaluation flow.
Compatibility-first extension over RecBole (adoption and reuse constraint).
A recurring failure mode in research tooling is that a new framework reinvents the full training/evaluation stack, increasing maintenance burden and making it harder for the community to adopt. We address this by extending a widely used ecosystem (RecBole) rather than replacing it. This design choice matters because it preserves familiar abstractions (configuration management, dataset handling, checkpointing, and evaluation orchestration) while allowing explainability-specific features to be added without breaking established workflows.
A unified graph-and-path view that supports heterogeneous reasoning paradigms (interoperability constraint).
Path reasoning methods differ substantially in how they consume data: reinforcement-learning approaches navigate graph neighborhoods, while language-model approaches may require textualized and tokenized paths. We reconcile this heterogeneity by grounding methods in a common collaborative knowledge graph representation and then introducing specialized dataset/dataloader logic when a paradigm demands it (notably for path language models). This separation lets downstream users compare methods under aligned data splits and experimental protocols, while still respecting the distinct data requirements of each modeling family.
Explainability as a standardized output contract (extensibility and comparability constraint).
When explanations are produced ad hoc, evaluation and visualization become bespoke and non-transferable across models. We introduce a standardized notion of “explanation delivery”, where models expose paths in a consistent structure (conceptually: a scored user–item pair accompanied by a path). This matters because it decouples explanation generation from evaluation: new models can be integrated as long as they produce paths in the expected form, and the same analysis tools and metrics can be reused across approaches.
Evaluation designed around beyond-utility goals and path quality (evaluation-gap constraint).
Explainable recommendation introduces goals that are not captured by accuracy alone, and explanation paths require their own quality lenses. We extend the evaluation pipeline with metrics for beyond-utility properties and multiple dimensions of explanation-path quality, including coverage/fidelity of explainability, diversity of explanation patterns, and characteristics of the interactions and entities used in the paths. This design matters because it makes it feasible to run unified benchmarks that expose trade-offs: a method may recommend well but repeatedly reuse the same reasoning patterns, or it may generate diverse paths but fail to explain a substantial fraction of recommendations.
Evaluation, use cases, and lessons learned
We validate hopwise by using it as a benchmarking environment over two knowledge graphs and multiple representative explainable path-reasoning methods. The key contribution of this evidence is not a single “best model” outcome, but the fact that the same pipeline can jointly reveal differences across recommendation utility, beyond-utility objectives, and explanation-path behavior.
Several lessons emerge from the benchmark. First, methods can look similar under standard ranking metrics while diverging meaningfully in whether they can reliably produce explanation paths for recommended items; in the reported experiments, path language model approaches exhibit lower explanation fidelity than reinforcement-learning-based methods, highlighting a practical robustness issue when explanation generation is part of the required output. Second, modeling choices that are often treated as “implementation details” (for example, the specific embedding priors used in some reasoning methods) can be evaluated under controlled conditions, and the benchmark illustrates cases where such choices have limited impact relative to broader modeling assumptions. Third, the unified evaluation emphasizes that improving one dimension (such as coverage or novelty) can coincide with degradation in another (such as repetitiveness or limited diversity of explanation types), reinforcing the need to treat explainability as a multi-objective problem rather than a single-score add-on.
Limitations and scope
Our resource is deliberately scoped to path-based explainability over knowledge graphs. It does not aim to be a general framework for all explainability paradigms in recommendation (for example, feature attribution or free-form natural language justifications are not the organizing principle of the library).
The evaluation layer is strongest when explanation outputs are well-formed paths; therefore, conclusions drawn from the explanation-quality metrics should be interpreted as path-centric. Additionally, the practical utility of fairness- or group-disparity-style metrics depends on whether a dataset provides the necessary user-group signals, which may not always be available or reliable. Finally, while the library supports multiple datasets and models, the breadth of coverage remains bounded by the included knowledge graphs and the effort required to maintain preprocessing compatibility as upstream datasets evolve.
Access and reuse
Code and data artifacts are available via a public GitHub repository (https://github.com/tail-unica/hopwise).
Conclusions
hopwise treats explainable path reasoning as an end-to-end research workflow: not only training a recommender system, but producing structured explanation paths and evaluating them with dedicated, comparable criteria. This shifts path-based explainable recommendation from a collection of isolated implementations into a setting where results can be reproduced, ablations can be meaningfully compared, and new methods can be integrated with less overhead.
The paper outlines concrete extensions that align with this goal, including expanding the set of supported models (e.g., additional knowledge-aware recommenders) and contributing curated knowledge graphs to broader dataset repositories used by the community. More broadly, the resource creates room for community-driven growth: broader domain coverage, richer connectors to external toolchains, and stronger guarantees around explanation reliability, each of which becomes more feasible when built on a shared, standardized experimental substrate rather than one-off codebases.