Sustainability-aware food recommender systems require environmental impact signals at the ingredient and recipe level to reason about the consequences of personalized meal suggestions. Yet large-scale food recommendation corpora rarely provide carbon and water footprint estimates that can be robustly aligned with recipe ingredients, limiting comparability and reproducibility. GreenFoodLens is a dataset resource that enriches the HUMMUS user–recipe interaction network with environmental impact estimates derived from the hierarchical taxonomy of the SU-EATABLE-LIFE project, enabling sustainability-centered benchmarking and analysis at scale.
Food recommendation is increasingly expected to optimize beyond immediate user utility. In this domain, “beyond-accuracy” is not only about diversity or novelty: what a system encourages people to consume, and how those choices translate into environmental impact.
A recurring bottleneck is measurement infrastructure. If footprints exist only at a coarse commodity level, or if labels cannot be consistently mapped to real-world ingredient strings, then it becomes difficult to evaluate footprint-aware objectives, compare methods under shared assumptions, or study how recommendation feedback loops interact with sustainability.
In a study, in cooperation with Giacomo Balloccu, Gianni Fenu, Mirko Marras, Giacomo Medda, and Giovanni Murgia, and published in the Proceedings of ACM RecSys 2025, we introduce GreenFoodLens. GreenFoodLens is a dataset enrichment that augments a large user–recipe interaction network with sustainability information. Conceptually, it couples three layers that are typically studied in isolation: users’ historical recipe choices, recipes’ ingredient compositions, and a hierarchical sustainability taxonomy that associates food commodities with carbon and water footprint estimates.
Thanks to this alignment, we can measure the footprint distribution of recommendations, analyze how model families differ in their sustainability profiles, and design objectives or constraints that explicitly trade off preference satisfaction with environmental impact. The hierarchical structure is particularly useful because it supports reasoning at multiple granularities: when a fine-grained match is uncertain, the resource can still support stable analyses at higher levels of the taxonomy.
Resource description

Resource construction logic: we start from audited human labeling and quality checks, then scale to broader coverage via LLM-driven labeling that first elicits a concise ingredient “bootstrap” description and finally assigns a taxonomy path through constrained generation.
A hybrid human–LLM pipeline to balance quality and coverage
A central failure mode in large-scale labeling is the tension between reliability and scalability. Purely manual annotation does not scale to the long tail of ingredient variants, while fully automated labeling can drift, hallucinate, or assign invalid categories without oversight.
We therefore combine audited human annotation (to establish correctness anchors and identify systematic ambiguities) with automated labeling (to expand coverage). This hybrid design matters downstream because it supports both trustworthy evaluation subsets and broad, dataset-wide analyses, rather than forcing researchers to choose one or the other.
Hierarchy-aware labeling that embraces ambiguity instead of forcing false precision
Ingredient strings are often underspecified: a single token may map to multiple plausible commodities, processing forms, or cultural variants. Forcing leaf-level precision can produce brittle labels that look specific but are wrong, and such errors propagate directly into footprint estimates.
We address this by leveraging a hierarchical taxonomy and designing the labeling process to allow meaningful stopping points at higher levels when specificity is not justified. This principle improves robustness for downstream users: analyses can be conducted at a granularity aligned with label confidence, and footprint reasoning can remain interpretable even when the ingredient evidence is incomplete.
Constrained generation to keep automated labels structurally valid
Applying LLMs to taxonomy assignment introduces a distinct failure mode: even when the model “knows” what an ingredient is, it may output labels that do not correspond to a valid path in the taxonomy, or produce formatting and consistency errors that are costly to clean up.
We mitigate this by separating two conceptual steps. First, we prompt the model to generate a brief, task-relevant description of the ingredient to stabilize the decision context. Second, we enforce that the final label is produced through constrained generation so that the model’s output remains a valid taxonomy traversal. This choice matters because it shifts automated labeling from open-ended text generation to schema-respecting annotation, which is essential for a resource intended for benchmarking and reuse.
Evaluation, use cases, and lessons learned
The paper validates GreenFoodLens through two complementary lenses: label quality and the behavior of recommender baselines under sustainability analysis.
On label quality, we assess automated ingredient labeling against a human-labeled reference subset. The resulting picture is practically important: broad-category assignments are comparatively stable, while finer-grained distinctions are more error-prone. This directly informs how the resource should be used—coarse-grained sustainability analyses can be robust, while fine-grained studies benefit from uncertainty awareness and, where possible, additional context.
On recommendation behavior, we run standard model families and examine how sustainability profiles interact with conventional utility and beyond-accuracy characteristics. A key lesson is that sustainability outcomes are not a free byproduct of better recommendation accuracy. Recommendations tend to reflect the interaction distribution, and popularity-driven signals can dominate, meaning that, in absence of explicit sustainability constraints, models may only partially deviate from “most popular” behavior. The paper explicitly highlights the risk that, if historical interactions were skewed toward high-impact recipes, standard training would likely reinforce a corresponding “unsustainable” exposure pattern.
Limitations and scope
GreenFoodLens focuses on carbon and water footprints derived from an external sustainability taxonomy and does not attempt to cover every environmental dimension. Recipe-level footprints are obtained by aggregating ingredient scores; because ingredient quantity information can be heterogeneous or unreliable, the aggregation does not fully model serving-size or quantity-aware accounting.
On the labeling side, constrained generation keeps taxonomy assignments structurally valid, but the intermediate “bootstrap” descriptions remain dependent on the model’s prior knowledge and can be inconsistent for rare or ambiguous ingredients. More broadly, labeling an ingredient string without additional context can be intrinsically ambiguous, and some errors reflect this underdetermination rather than a single correctable bug.
Access and reuse
The paper states that the resource is available at the project repository: https://github.com/tail-unica/GreenFoodLens.
Conclusions
By coupling a large interaction corpus with hierarchy-aware carbon and water footprint estimates, GreenFoodLens enables reproducible evaluation of sustainability-aware objectives and clearer analysis of how recommender behavior interacts with environmental impact.
The most direct extensions follow from the resource’s current boundaries: strengthening context-aware labeling for ambiguous ingredients, adding safeguards that also constrain intermediate knowledge generation (not only the final taxonomy path), and improving recipe-level impact aggregation when quantity information can be normalized. These directions preserve the resource’s core intent: enabling sustainability-aware recommendation research that is measurable, comparable, and extensible.