This dramatically increased the cost and time of clinical studies


To compare predicted scores with the ground truth data, F1 scores were calculated. F1 scores, the harmonic mean of precision and recall, were calculated using the set of tokens from the ASA24 description and the predicted set based on each food photo. True positives were counted as tokens correctly predicted by the algorithm , false negatives represented tokens present in the ASA24 description but not predicted by the algorithm, and false positives were flagged as tokens predicted by the algorithm but absent from the ASA24 description. The Kruskal–Wallis rank sum test with Dunn’s post hoc test was used to evaluate differences in median F1 scores followed by Benjamini–Hochberg multiple testing correction. Associations between nutrients estimated from ASA24 and Bitesnap were tested using multiple regression and Spearman’s rank partial correlations corrected for age, BMI, ethnicity, and education level. Age and BMI were treated as continuous variables, whereas ethnicity and education level were represented as categorical variables; ethnicity was split into either White or Non-White given the imbalanced representation of Non-Caucasians in this study; education level represented a four-factor variable with the following levels: high school graduate, bachelor’s degree, some college or associate degree, and professional degree . These covariates were selected on the basis that nutrient intakes co-vary with these demographic variables. Linear models were created using either the stats package or the RVAideMemoire package version 0.9-81-2,raspberries in pots in the case of Spearman’s rank partial correlations, with R version 4.1.0.

Bootstrapped data from the correlation matrix using 1000 iterations were used to construct 95% confidence intervals. Normality was assessed using the Shapiro–Wilk test on residuals of the model and observing deviations in the residuals of quantile–quantile plots. For non-normally distributed data, transformations were used to approximate normal distributions; otherwise, Spearman’s rank correlations were used. The textual representations of food descriptions for each meal image were parsed prior to calculating F1 scores for tokens between SNAPMe and Im2Recipe or Facebook Inverse Cooking datasets. Joined food descriptions for each meal image were cleaned through removing punctuation, lemmatizing each word into the base form , and removing “stop words”, which are words that have no relevance to the text-matching task. A token was considered a stop word and removed if it did not provide information that could be used to identify a specific food. This strategy helped to reconcile the different syntaxes of the USDA Food and Nutrient Database for Dietary Studies 17-18 and Recipe 1M, the respective databases linked to SNAPMe and publicly available algorithms prior to calculating the accuracy between matching food descriptions. Additionally, manual entries were created for ASA24 food descriptions not fully resolved into ingredients . The FB Inverse Cooking and Im2Recipe algorithms were selected for testing based on their being open source and well documented. The 1477 “before” food photos were processed using the inverse-cooking and Im2Recipe algorithms. Of the 1477 photos, 14 could not be processed without error. The 14 problematic food photos included 13 photos with a dark background and 1 photo with non-food items in the image .

For the 1463 photos that could be processed, the predicted ingredients were compared with the food items in the ASA24 linkage file. For multi-ingredient foods , this proved difficult because the ingredient prediction algorithm appropriately predicted ingredients whereas the ASA24 file contained food items that may or may not have an available recipe from which ingredients can be derived. We therefore further manually “ingredientized” the ASA24 food descriptions when the food item was a mixed dish . Ingredients predicted by FB Inverse Cooking from ASA24 food records were evaluated using F1 scores. FB Inverse Cooking predictions for individual foods are provided in Supplemental File S2. The mean F1 score for all predictions was 0.23 . Using the number of food codes from ASA24 records as a proxy for the number of ingredients in an image, the mean F1 score increased successively for images with 2–3 food codes and those with 4 or more food codes compared to single food-code images . The Im2Recipe algorithm had an overall lower performance with a mean F1 score of 0.13 . Similar to the FB cooking algorithm, Im2Recipe was better at predicting multi-ingredient foods with a mean F1 score of 0.15 and 0.18 for 2–3 and 4+ food codes, respectively, compared to images corresponding to a single food code .Participants were asked to reflect on their experiences with a post-study questionnaire . Participants indicated a preference for reporting their diet using Bitesnap compared to ASA24 . When asked to rate their experience with ASA24 and Bitesnap, 90% of participants and 85% of participants, respectively, agreed that ASA24 and Bitesnap accurately captured their diet . Many participants agreed that both ASA24 and Bitesnap were easy to use and not burdensome. When logging their meal with Bitesnap, 53% of participants reported doing it right after eating so they could remember what they just ate, and 32% of participants reported doing it at the end of the day. For ASA24, which was used in “food record” mode, 73% of participants reported their food intake at the end of the day.

On average, however, 77% of participants answered that it took them less time to log meals into Bitesnap than ASA24, 18% of participants answered that it took the same amount of time as ASA24, and 5% answered that it took little more time than ASA24. For ASA24 and Bitesnap, 27% and 23% of participants, respectively, disliked the lack of multicultural food choices, including Vietnamese, Taiwanese, Italian, Japanese, and Korean food. With the ASA24 system, the participants reported more difficulties reporting homemade food. For recipes that were complex, both platforms were difficult because the participant needed to provide a more detailed description of the meal so that the system could recognize and record what they ate. In this study, we developed SNAPMe DB, the first publicly available food photo benchmark database collected in the context of dietary assessment by U.S. adults, containing over 3000 food photos linked with ASA24 food records. Previous experience with image based approaches in clinical studies required the use of trained analysts to refine the annotations after the images were been processed. A goal for the future would be to maintain data quality while minimizing input from trained analysts. The SNAPMe benchmark can be used to advance this goal through providing a means to compare different methods on the same dataset. SNAPMe DB can be used to evaluate existing models for the prediction of foods, ingredients, and/or nutrients, as captured in the context of “real world” dietary assessment. As part of our strategy to provide quality data in SNAPMe DB, we screened participants for their ability to judge ingredients and portion sizes. Of the 279 participants screened, 83 did not pass the screener for the estimation of foods and ingredients, even though our enrollment pool was enriched with students in university nutrition departments. This suggests many participants in the general population will not be able to record food very well and that the use of food photos may improve the dietary assessment of these individuals. To demonstrate the utility of the SNAPMe DB, we evaluated the ability of publicly available algorithms to predict food ingredients. The Facebook Inverse Cooking and Im2Recipe algorithms resulted in low accuracy. We speculate that the primary reason for this poor performance is a result of these models being trained on recipes, such as the Recipe1M database, and not on core or single-ingredient foods or beverages. This observation is supported by a lower mean F1 score for single-FoodCode images compared to images with multiple FoodCodes that would typically represent a mixed meal or recipe. Additionally,blueberries in containers growing images used for training these models were sourced from internet websites, such as cooking and recipe websites, which typically portray the finished dish and may have differences in their appearance compared to a plated dish combined with other foods not part of that recipe. This domain gap will need to be addressed in the development of future models. Differences in syntax between ASA24 food recall output and ingredient prediction algorithms also contributed to lower scores despite parsing and cleaning text descriptions. Methods to standardize text descriptions for food and nutrient databases are needed to harmonize syntactic differences and increase interoperability. We also evaluated predictions of nutrient estimates using Bitesnap compared with those estimated from ASA24 recalls. Nutrient estimates from Bitesnap modestly correlated with food records, with the lowest performance for food folate and the highest for cholesterol . Because cholesterol is derived from animal products, predicting cholesterol content from food photos is simplified based on the limited number of foods that contain cholesterol.

Moreover, animal products such as meat are more often consumed in mixed meals rather than single-ingredient foods, facilitating better predictions from the training data. Similarly, alcohol and caffeine were among the top predicted nutrients which occur in a relatively small subset of foods. In contrast, folate from food and calcium both had low predictive values, which may be a result of their ubiquity in the food supply, including many single-ingredient foods. Previous studies evaluating the performance of photo-based dietary assessment methods for nutrient estimation have found correlations of similar strength; however, those methods shift some of the effort to a dietitian. In a post-study questionnaire, participants were asked about their experiences. The post-study questionnaire was not ideal for comparing the two systems because we asked participants to do different things for Bitesnap compared to ASA24. For example, partici-pants did not have to record ‘after’ or multiple helpings, or use a sizing marker, in ASA24. Nevertheless, more participants preferred logging food photos than using ASA24. Participants reported trouble with multicultural foods. For ASA24 and Bitesnap, 23% and 30% of all participants indicated difficulty finding foods in the respective system. ASA24 and Bitesnap both use the USDA’s Food Data Central database, which is limited in recipes or ingredients less commonly used in American diets. Of the 23 Asian participants that completed the post-study questionnaire, 15 reported difficulties logging Chinese, Vietnamese, Thai, Japanese, and Korean foods. As such, multicultural foods need to be added to food composition databases to improve inclusivity and diversity in dietary data methods. Some recent studies have evaluated imaged-based dietary assessment systems in comparison to weighed food records or unweighted food records. However, these studies have not released labeled image databases, so they cannot be re-used to compare image processing techniques. In the current study, we elected to use ASA24 food records rather than weighed food records, as the increased burden of the latter would have likely reduced the number and variety of foods reported. There are limitations to our study. Restaurant foods were intentionally under-represented as it was expected that participants would not know the details of the ingredients used to prepare these meals. Also, study participants were predominantly young women, so the database does not represent a cross-section of the population. However, it was not intended to represent a cross-section; it was intended to be created by participants with knowledge of food and portion sizes, which is not true of the general population. Finally, our evaluation of existing algorithms was not necessarily comprehensive; these are intended to provide examples of the use of the SNAPMe benchmark. Importantly, our ground truth data are likely imperfect because we did not provide participants with weighed, packed-out food in coolers. It would have been impractical to prepare thousands of different meals, and those meals would not have been reflective of what or how people eat “in the wild.” In order to represent thousands of meals from diverse participants consuming their usual foods, we implemented safeguards such as the targeted recruitment of participants from nutrition programs, the pre-screening of prospective participants for their ability to identify ingredients and portion sizes, the simultaneous recording of food records, near real-time checks of food photos and ASA24 records with participants prompted to re-do study days if they recorded foods that did not have an accompanying food photo, and extensive quality control during which each line of each ASA24 food record was linked to a food photo. Multiplicity occurs when many hypotheses are tested simultaneously without consideration of one another, and often results in false-positive findings or spurious associations.