Privacy-preserving & machine-learned catchment models for national dietary surveillance via digital footprint data

Horizon’s Transitional Assistant Professor, Georgiana Nica-Avram has co-authored a paper ‘Privacy-preserving & machine-learned catchment models for national dietary surveillance via digital footprint data’ which was submitted to the 2022 IEEE International Conference on Big Data.

Abstract

Big data from food retail stores is increasingly being used for population dietary surveillance, epidemiological studies of diet-related diseases, and evaluations of public health interventions. However, for retail data to be useful it is necessary to understand the spatio-temporal variation of when and where food is purchased and consumed. While some customers willingly share home location data with retailers as part of loyalty programs such data is typically too fine-grained/sensitive to be applied for research purposes. The aim of this study was to analyse differences between privacy-preserving models and actual retail catchments, and investigate if machine learning techniques could improve the accuracy of such catchment models. Based on a UK-wide sample of 4 million grocery store loyalty card holders, covering 485 million transactions over 29 months (2019-2021) and distributed across 33,000 neighbourhoods (Lower Super Output Areas, or LSOA), the study demonstrates how models trained on geolocated data perform at predicting, per store, catchment areas which contain 50, 80, and 95% of its customers’ primary location. Through comparative assessment of machine learning approaches, we find better performance from tree-based models (RF, XGB) with the best performance from an XGB model achieving an R2 of 0.72 and MAE of 1.06. To conclude, we review variable importance measures using SHAP values and discuss the relative merits of including specific features when modeling catchment areas. © 2022 IEEE.