Executive Summary

Dietary guidelines and nutrition labels guide public food choices, yet individual needs vary. This study aims to develop a personalized food recommendation algorithm based on nutrient profiles. I compared spectrum clustering and k-means clustering, finding k-means to be superior. This document outlines the methods, analyses, and practical applications of these findings.

Introduction

Nutrition’s Role in Health

Diet significantly impacts health outcomes, from cardiovascular disease (Shanta Retelny, Neuendorf, and Roth 2008) to cognitive function (Dani, Burrill, and Demmig-Adams 2005). However, many people struggle with unhealthy eating habits, often due to an overabundance of food choices. Improving dietary decisions through better food groupings could help mitigate this issue.

Issues with Current Food Groupings

Existing nutrition datasets categorize foods, but the methods are unclear, leading to inconsistencies in nutrient profiles within groups. This study reinterprets food groupings, aiming to create clusters that better align with individual dietary needs. Specifically, the results can be used to identify foods with the optimal combination of nutrients tailored to a person’s dietary requirements.

Study Goals

This analysis of public nutrient datasets aims to develop an algorithm that recommends foods based on nutrient profiles, identifying clusters with high intra-group and low inter-group similarity.

Methods

Exploratory Data Analysis

I used the Canadian Nutrient File (2015) dataset, containing nutrient values for over 5690 foods. After merging datasets by unique identifiers, I ran exploratory analyses. Despite some redundant and missing data, I opted for mean imputation to handle missing values, ensuring consistency in subsequent analyses.

Main Analyses

I compared k-means and spectrum clustering to group foods by nutrient profiles, aiming for high intra-group similarity and low inter-group similarity. The silhouette method was used to identify optimal cluster numbers, and data were standardized before analysis.

Results

K-means vs. Spectrum Clustering

K-means clustering

Using the silhouette method, 2 clusters were identified (see appendix). K-means clustering resulted in a high proportion of total sum of squares explained by between-cluster sum of squares, indicating good cluster separation.

Spectrum Clustering

A scree plot guided the selection of 10 principal components (PCs) for spectrum clustering. K-means clustering on these PCs produced 8 clusters (see appendix). The plot below shows the 8 clusters formed using spectrum clustering across the first two PCs, with labels for the centroid of each cluster.

Comparing Methods

K-means outperformed spectrum clustering, as evidenced by higher silhouette width and Dunn index scores, suggesting better-defined clusters.

Exploring “intelligent” clusters

To explore the ‘intelligent’ clusters created through k-means clustering, we plotted word clouds of the most commonly used words within each cluster. For instance, the first word cloud shows that the most frequently used word in cluster 1 was ‘raw,’ followed by ‘frozen,’ ‘boiled,’ and ‘canned.’ This suggests that the cluster is defined by how a food is stored or cooked. In terms of specific foods, ‘cereal’ is frequently mentioned, along with other sugary foods.

The word cloud for the second cluster shows that ‘raw’ is also frequently used in this cluster. Notably, a larger proportion of the foods in this cluster are related to meat and high-protein foods, such as ‘meat,’ ‘lean,’ ‘fat,’ and ‘fish.’

We also explored which group had the highest calorie content, finding that cluster 2 had a noticeably higher calorie content compared to cluster 1. Therefore, if someone is focused on reducing their overall calorie intake, it may be advisable to avoid foods found in cluster 2.

Case Study: Designing a Muscle-Gain Diet

After determining the best clustering approach based on food nutrient profiles, I provide a case study to illustrate how this tool can be used to make recommendations for a person with specific dietary needs.

Arnold is an aspiring professional bodybuilder interested in gaining muscle as quickly as possible. Understanding the importance of nutrition in muscle building, he plans to adjust his diet to achieve this goal, following the guidance he found here and here. Arnold, currently weighing 180 pounds, aims to consume 1.5 grams of protein per pound of bodyweight, which totals 270 grams of protein daily. Additionally, it is recommended that he consumes between 2-3 grams of carbohydrates per pound, equating to 360-540 grams of total carbohydrates per day. Finally, he should intake at least 3600 kilocalories daily to support muscle gain. Arnold also seeks to consume several other nutrients at levels above 2 standard deviations above the mean due to their muscle-building properties, including calcium, biotin, iron, vitamin C, selenium, Omega-3, vitamin D, vitamin B12, copper, magnesium, riboflavin, and zinc.

To simulate Arnold’s dietary needs, I created a dataset from my main analyses and compared Arnold’s daily nutrient targets to the daily nutrient recommendations for someone in the general population who matches him in all characteristics except activity level. I used the DRI Calculator for Healthcare Professionals from the National Argicultural Library to determine a point of comparison for Arnold’s nutrient intake. This tool calculates daily nutrient recommendations based on the Dietary Reference Intakes (DRIs) established by the Health and Medicine Division of the National Academies of Sciences, Engineering, and Medicine. I input gender (male), age (20), height (5 feet 10 inches), weight (180 pounds), and activity level (sedentary) for someone matching Arnold in all characteristics except activity level, as Arnold will have a higher activity level than the typical person. Therefore, the nutrient recommendations from this tool represent someone who matches Arnold in all characteristics but does not aspire to be a professional bodybuilder (and hence has a lower activity level). I then compared Arnold’s nutrient goals to these recommendations. For example, Arnold aims to consume 3600 kilocalories per day, compared to the recommended 2734 kilocalories for someone less active. Thus, Arnold is consuming approximately 1.32 times the recommended kilocalories. Similarly, Arnold will consume 1.2 times the recommended carbohydrates, 4.15 times the recommended total carbohydrates, 270 times the recommended saturated fats, and 10.8 times the recommended total fat. I created nutrient data for Arnold by multiplying the average value of each target nutrient across the dataset by the amount Arnold is consuming above the mean (e.g., 1.32 * average value of nutrient across all foods) for kilocalories, total carbohydrates, saturated fats, protein, and total fat. For the nutrients Arnold aims to consume at least 2 standard deviations above the mean (i.e., calcium, biotin, iron, vitamin C, selenium, Omega-3, vitamin D, vitamin B12, copper, magnesium, riboflavin, and zinc), I calculated and inserted the value representing 2 standard deviations above the mean into Arnold’s nutrient data.

Next, I identified the cluster that most closely matched Arnold’s nutrient needs by finding the cluster with the smallest Euclidean distance from Arnold’s goal nutrient profile. My analyses indicated that Arnold would best achieve his goals by eating foods in cluster 2. After identifying the cluster that best fit Arnold, I identified the foods with the smallest Euclidean distance from the center of cluster 2 in the original dataset to provide recommendations for the top 10 foods that meet Arnold’s target nutrient profile. The table below clearly shows that Arnold should focus on meats and fish to achieve his goal of becoming a professional bodybuilder.

Food Distance
Duck, young duckling, domestic, White Pekin, breast, meat and skin, boneless, roasted 2.57
Fish, butterfish, baked or broiled 2.57
Veal, rib, lean and fat, roasted 2.59
Veal, shoulder, shank, lean and fat, roasted 2.60
Chicken, broiler, meat, skin, giblets and neck, batter dipped, fried 2.67
Chicken, roasting, meat, skin, giblets and neck, roasted 2.68
Chicken, broiler, drumstick, meat and skin, batter dipped, fried 2.69
Chicken, roasting, meat and skin, roasted 2.70
Veal, sirloin, lean and fat, roasted 2.71
Veal, ground, broiled 2.73

Conclusion

Contrary to our expectations, k-means clustering provided better results than spectrum clustering. Our case study illustrates how these clusters can inform dietary choices, such as optimizing nutrient intake for muscle gain.

Limitations

The dataset’s average nutrient values and focus on Canadian products limit generalizability. Additionally, nutrient values are standardized, requiring portion size adjustments for practical use. Future work could expand these analyses to other countries and explore food-specific goals.

Appendix

Missing Values in Original Dataset

Data Frame Summary

data_mean_imp

Dimensions: 5690 x 151
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 PROTEIN
[numeric]
Mean (sd) : 11.1 (10.8)
min < med < max:
0 < 7.6 < 85.6
IQR (CV) : 16.6 (1)
2261 distinct values 0
(0.0%)
2 FAT (TOTAL LIPIDS)
[numeric]
Mean (sd) : 10 (16.7)
min < med < max:
0 < 3.8 < 100
IQR (CV) : 11.6 (1.7)
1913 distinct values 0
(0.0%)
3 CARBOHYDRATE, TOTAL (BY DIFFERENCE)
[numeric]
Mean (sd) : 22 (26.5)
min < med < max:
0 < 10.3 < 100
IQR (CV) : 31.6 (1.2)
2756 distinct values 0
(0.0%)
4 ASH, TOTAL
[numeric]
Mean (sd) : 1.9 (3.5)
min < med < max:
0 < 1.2 < 99.8
IQR (CV) : 1.2 (1.8)
647 distinct values 0
(0.0%)
5 ENERGY (KILOCALORIES)
[numeric]
Mean (sd) : 219 (174)
min < med < max:
0 < 174 < 902
IQR (CV) : 240 (0.8)
665 distinct values 0
(0.0%)
6 ALCOHOL
[numeric]
Mean (sd) : 0.1 (1.8)
min < med < max:
0 < 0 < 42.5
IQR (CV) : 0 (14.3)
38 distinct values 0
(0.0%)
7 MOISTURE
[numeric]
Mean (sd) : 55 (31)
min < med < max:
0 < 64.7 < 100
IQR (CV) : 50.3 (0.6)
3417 distinct values 0
(0.0%)
8 CAFFEINE
[numeric]
Mean (sd) : 3.9 (97.8)
min < med < max:
0 < 0 < 5714
IQR (CV) : 0 (24.9)
62 distinct values 0
(0.0%)
9 THEOBROMINE
[numeric]
Mean (sd) : 7 (74.8)
min < med < max:
0 < 0 < 2634
IQR (CV) : 0 (10.7)
131 distinct values 0
(0.0%)
10 ENERGY (KILOJOULES)
[numeric]
Mean (sd) : 915 (727)
min < med < max:
0 < 727 < 3774
IQR (CV) : 1006 (0.8)
1659 distinct values 0
(0.0%)
11 SUGARS, TOTAL
[numeric]
Mean (sd) : 7.7 (13.6)
min < med < max:
0 < 3 < 99.8
IQR (CV) : 7.7 (1.8)
1506 distinct values 0
(0.0%)
12 FIBRE, TOTAL DIETARY
[numeric]
Mean (sd) : 2.4 (4.7)
min < med < max:
0 < 1 < 79
IQR (CV) : 2.7 (1.9)
238 distinct values 0
(0.0%)
13 CALCIUM
[numeric]
Mean (sd) : 76.9 (219)
min < med < max:
0 < 25 < 7364
IQR (CV) : 63 (2.8)
469 distinct values 0
(0.0%)
14 IRON
[numeric]
Mean (sd) : 2.6 (5.6)
min < med < max:
0 < 1.1 < 124
IQR (CV) : 2.1 (2.2)
883 distinct values 0
(0.0%)
15 MAGNESIUM
[numeric]
Mean (sd) : 39.7 (63.6)
min < med < max:
0 < 22 < 781
IQR (CV) : 26.7 (1.6)
303 distinct values 0
(0.0%)
16 PHOSPHORUS
[numeric]
Mean (sd) : 168 (233)
min < med < max:
0 < 136 < 9918
IQR (CV) : 172 (1.4)
621 distinct values 0
(0.0%)
17 POTASSIUM
[numeric]
Mean (sd) : 308 (441)
min < med < max:
0 < 240 < 16500
IQR (CV) : 208 (1.4)
896 distinct values 0
(0.0%)
18 SODIUM
[numeric]
Mean (sd) : 333 (1214)
min < med < max:
0 < 83 < 38758
IQR (CV) : 335 (3.6)
1100 distinct values 0
(0.0%)
19 ZINC
[numeric]
Mean (sd) : 1.6 (3)
min < med < max:
0 < 0.9 < 91
IQR (CV) : 1.7 (1.8)
696 distinct values 0
(0.0%)
20 COPPER
[numeric]
Mean (sd) : 0.2 (0.6)
min < med < max:
0 < 0.1 < 15.1
IQR (CV) : 0.2 (2.8)
788 distinct values 0
(0.0%)
21 MANGANESE
[numeric]
Mean (sd) : 0.6 (3.5)
min < med < max:
0 < 0.2 < 133
IQR (CV) : 0.6 (5.8)
1227 distinct values 0
(0.0%)
22 SELENIUM
[numeric]
Mean (sd) : 14.6 (34.3)
min < med < max:
0 < 10 < 1917
IQR (CV) : 16.7 (2.4)
615 distinct values 0
(0.0%)
23 RETINOL
[numeric]
Mean (sd) : 88.8 (802)
min < med < max:
0 < 0 < 30000
IQR (CV) : 27 (9)
327 distinct values 0
(0.0%)
24 BETA CAROTENE
[numeric]
Mean (sd) : 292 (1610)
min < med < max:
0 < 2 < 42891
IQR (CV) : 120 (5.5)
613 distinct values 0
(0.0%)
25 ALPHA-TOCOPHEROL
[numeric]
Mean (sd) : 1.2 (3.5)
min < med < max:
0 < 0.6 < 149
IQR (CV) : 1 (3)
448 distinct values 0
(0.0%)
26 VITAMIN D (INTERNATIONAL UNITS)
[numeric]
Mean (sd) : 23.9 (226)
min < med < max:
0 < 0 < 12716
IQR (CV) : 18 (9.5)
215 distinct values 0
(0.0%)
27 VITAMIN D (D2 + D3)
[numeric]
Mean (sd) : 0.6 (5.9)
min < med < max:
0 < 0 < 318
IQR (CV) : 0.4 (9.4)
130 distinct values 0
(0.0%)
28 VITAMIN C
[numeric]
Mean (sd) : 8.2 (52.3)
min < med < max:
0 < 0.2 < 1900
IQR (CV) : 4.6 (6.4)
459 distinct values 0
(0.0%)
29 THIAMIN
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0.1 < 23.4
IQR (CV) : 0.2 (2.5)
813 distinct values 0
(0.0%)
30 RIBOFLAVIN
[numeric]
Mean (sd) : 0.2 (0.4)
min < med < max:
0 < 0.1 < 17.5
IQR (CV) : 0.2 (2)
710 distinct values 0
(0.0%)
31 NIACIN (NICOTINIC ACID) PREFORMED
[numeric]
Mean (sd) : 3.1 (4.3)
min < med < max:
0 < 1.8 < 128
IQR (CV) : 4.3 (1.4)
2829 distinct values 0
(0.0%)
32 TOTAL NIACIN EQUIVALENT
[numeric]
Mean (sd) : 5.2 (5.5)
min < med < max:
0 < 3.9 < 132
IQR (CV) : 6.9 (1.1)
3909 distinct values 0
(0.0%)
33 PANTOTHENIC ACID
[numeric]
Mean (sd) : 0.6 (0.9)
min < med < max:
0 < 0.6 < 21.9
IQR (CV) : 0.5 (1.3)
1317 distinct values 0
(0.0%)
34 VITAMIN B-6
[numeric]
Mean (sd) : 0.2 (1)
min < med < max:
0 < 0.1 < 68.8
IQR (CV) : 0.2 (4.3)
757 distinct values 0
(0.0%)
35 TOTAL FOLACIN
[numeric]
Mean (sd) : 37.7 (89.9)
min < med < max:
0 < 14 < 3786
IQR (CV) : 32.7 (2.4)
291 distinct values 0
(0.0%)
36 VITAMIN B-12
[numeric]
Mean (sd) : 1.1 (6.5)
min < med < max:
0 < 0.1 < 380
IQR (CV) : 1 (5.9)
900 distinct values 0
(0.0%)
37 VITAMIN K
[numeric]
Mean (sd) : 20.8 (74.6)
min < med < max:
0 < 20.8 < 1714
IQR (CV) : 19.4 (3.6)
435 distinct values 0
(0.0%)
38 FOLIC ACID
[numeric]
Mean (sd) : 8.4 (48.4)
min < med < max:
0 < 0 < 2993
IQR (CV) : 0 (5.7)
161 distinct values 0
(0.0%)
39 TRYPTOPHAN
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0.1 < 1.6
IQR (CV) : 0.1 (0.8)
459 distinct values 0
(0.0%)
40 THREONINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 3.7
IQR (CV) : 0.5 (0.8)
1301 distinct values 0
(0.0%)
41 ISOLEUCINE
[numeric]
Mean (sd) : 0.6 (0.4)
min < med < max:
0 < 0.6 < 5
IQR (CV) : 0.5 (0.8)
1370 distinct values 0
(0.0%)
42 LEUCINE
[numeric]
Mean (sd) : 1 (0.7)
min < med < max:
0 < 1 < 7.2
IQR (CV) : 0.9 (0.8)
1835 distinct values 0
(0.0%)
43 LYSINE
[numeric]
Mean (sd) : 0.9 (0.8)
min < med < max:
0 < 0.9 < 5.8
IQR (CV) : 0.9 (0.9)
1699 distinct values 0
(0.0%)
44 METHIONINE
[numeric]
Mean (sd) : 0.3 (0.2)
min < med < max:
0 < 0.3 < 3.2
IQR (CV) : 0.3 (0.8)
860 distinct values 0
(0.0%)
45 CYSTINE
[numeric]
Mean (sd) : 0.2 (0.1)
min < med < max:
0 < 0.2 < 2.1
IQR (CV) : 0.1 (0.8)
496 distinct values 0
(0.0%)
46 PHENYLALANINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 5.2
IQR (CV) : 0.4 (0.7)
1275 distinct values 0
(0.0%)
47 TYROSINE
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0 < 0.4 < 3.3
IQR (CV) : 0.4 (0.8)
1132 distinct values 0
(0.0%)
48 VALINE
[numeric]
Mean (sd) : 0.6 (0.5)
min < med < max:
0 < 0.6 < 6.2
IQR (CV) : 0.5 (0.8)
1429 distinct values 0
(0.0%)
49 ARGININE
[numeric]
Mean (sd) : 0.8 (0.7)
min < med < max:
0 < 0.8 < 7.4
IQR (CV) : 0.8 (0.9)
1627 distinct values 0
(0.0%)
50 HISTIDINE
[numeric]
Mean (sd) : 0.4 (0.3)
min < med < max:
0 < 0.4 < 2.3
IQR (CV) : 0.3 (0.8)
1076 distinct values 0
(0.0%)
51 ALANINE
[numeric]
Mean (sd) : 0.7 (0.5)
min < med < max:
0 < 0.7 < 8
IQR (CV) : 0.6 (0.8)
1490 distinct values 0
(0.0%)
52 ASPARTIC ACID
[numeric]
Mean (sd) : 1.2 (0.9)
min < med < max:
0 < 1.2 < 10.2
IQR (CV) : 1 (0.8)
1937 distinct values 0
(0.0%)
53 GLUTAMIC ACID
[numeric]
Mean (sd) : 2.4 (10.1)
min < med < max:
0 < 2.4 < 757
IQR (CV) : 1.7 (4.3)
2433 distinct values 0
(0.0%)
54 GLYCINE
[numeric]
Mean (sd) : 0.6 (0.6)
min < med < max:
0 < 0.6 < 19
IQR (CV) : 0.6 (0.9)
1439 distinct values 0
(0.0%)
55 PROLINE
[numeric]
Mean (sd) : 0.7 (0.5)
min < med < max:
0 < 0.7 < 12.3
IQR (CV) : 0.5 (0.8)
1420 distinct values 0
(0.0%)
56 SERINE
[numeric]
Mean (sd) : 0.5 (0.4)
min < med < max:
0 < 0.5 < 6.1
IQR (CV) : 0.4 (0.7)
1289 distinct values 0
(0.0%)
57 CHOLESTEROL
[numeric]
Mean (sd) : 41.5 (136)
min < med < max:
0 < 2 < 3100
IQR (CV) : 59 (3.3)
292 distinct values 0
(0.0%)
58 FATTY ACIDS, TRANS, TOTAL
[numeric]
Mean (sd) : 0.3 (1.1)
min < med < max:
0 < 0.3 < 37.6
IQR (CV) : 0.2 (3.6)
499 distinct values 0
(0.0%)
59 FATTY ACIDS, SATURATED, TOTAL
[numeric]
Mean (sd) : 3.1 (5.7)
min < med < max:
0 < 1.3 < 95.6
IQR (CV) : 3.3 (1.8)
2813 distinct values 0
(0.0%)
60 FATTY ACIDS, SATURATED, 8:0, OCTANOIC
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 7.5
IQR (CV) : 0 (6)
267 distinct values 0
(0.0%)
61 FATTY ACIDS, SATURATED, 10:0, DECANOIC
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 6
IQR (CV) : 0 (4.4)
351 distinct values 0
(0.0%)
62 FATTY ACIDS, SATURATED, 12:0, DODECANOIC
[numeric]
Mean (sd) : 0.2 (1.5)
min < med < max:
0 < 0 < 47
IQR (CV) : 0.2 (7.7)
449 distinct values 0
(0.0%)
63 FATTY ACIDS, SATURATED, 14:0, TETRADECANOIC
[numeric]
Mean (sd) : 0.2 (0.8)
min < med < max:
0 < 0.1 < 22.8
IQR (CV) : 0.2 (3.3)
788 distinct values 0
(0.0%)
64 FATTY ACIDS, SATURATED, 16:0, HEXADECANOIC
[numeric]
Mean (sd) : 1.7 (2.6)
min < med < max:
0 < 0.9 < 43.5
IQR (CV) : 1.8 (1.6)
2323 distinct values 0
(0.0%)
65 FATTY ACIDS, SATURATED, 18:0, OCTADECANOIC
[numeric]
Mean (sd) : 0.8 (1.6)
min < med < max:
0 < 0.4 < 33.2
IQR (CV) : 0.8 (1.9)
1676 distinct values 0
(0.0%)
66 FATTY ACIDS, MONOUNSATURATED, 18:1undifferentiated, OCTADECENOIC
[numeric]
Mean (sd) : 3.5 (6.8)
min < med < max:
0 < 1.4 < 82.6
IQR (CV) : 3.4 (1.9)
2709 distinct values 0
(0.0%)
67 FATTY ACIDS, POLYUNSATURATED, 18:2undifferentiated, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 1.8 (4.4)
min < med < max:
0 < 0.5 < 74.6
IQR (CV) : 1.7 (2.4)
2080 distinct values 0
(0.0%)
68 FATTY ACIDS, POLYUNSATURATED, 18:3undifferentiated, LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0.2 (1.2)
min < med < max:
0 < 0.1 < 53.4
IQR (CV) : 0.2 (5.6)
690 distinct values 0
(0.0%)
69 FATTY ACIDS, POLYUNSATURATED, 20:4, EICOSATETRAENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2.3)
262 distinct values 0
(0.0%)
70 FATTY ACIDS, POLYUNSATURATED, 22:6 n-3, DOCOSAHEXAENOIC (DHA)
[numeric]
Mean (sd) : 0 (0.5)
min < med < max:
0 < 0 < 18.2
IQR (CV) : 0 (9.5)
297 distinct values 0
(0.0%)
71 FATTY ACIDS, MONOUNSATURATED, 16:1undifferentiated, HEXADECENOIC
[numeric]
Mean (sd) : 0.2 (0.9)
min < med < max:
0 < 0.1 < 18.9
IQR (CV) : 0.2 (3.7)
771 distinct values 0
(0.0%)
72 FATTY ACIDS, POLYUNSATURATED, 18:4, OCTADECATETRAENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3
IQR (CV) : 0 (8.4)
127 distinct values 0
(0.0%)
73 FATTY ACIDS, POLYUNSATURATED, 20:5 n-3, EICOSAPENTAENOIC (EPA)
[numeric]
Mean (sd) : 0 (0.4)
min < med < max:
0 < 0 < 13.2
IQR (CV) : 0 (8.5)
260 distinct values 0
(0.0%)
74 FATTY ACIDS, MONOUNSATURATED, 22:1undifferentiated, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0.7)
min < med < max:
0 < 0 < 41.2
IQR (CV) : 0 (14.8)
200 distinct values 0
(0.0%)
75 FATTY ACIDS, POLYUNSATURATED, 22:5 n-3, DOCOSAPENTAENOIC (DPA)
[numeric]
Mean (sd) : 0 (0.2)
min < med < max:
0 < 0 < 5.6
IQR (CV) : 0 (10.9)
163 distinct values 0
(0.0%)
76 FATTY ACIDS, MONOUNSATURATED, TOTAL
[numeric]
Mean (sd) : 3.9 (7.6)
min < med < max:
0 < 1.4 < 83.7
IQR (CV) : 4.2 (1.9)
2881 distinct values 0
(0.0%)
77 FATTY ACIDS, POLYUNSATURATED, TOTAL
[numeric]
Mean (sd) : 2.2 (5.1)
min < med < max:
0 < 0.7 < 74.6
IQR (CV) : 2 (2.3)
2382 distinct values 0
(0.0%)
78 NATURALLY OCCURRING FOLATE
[numeric]
Mean (sd) : 29.2 (71.9)
min < med < max:
0 < 11 < 2340
IQR (CV) : 24.2 (2.5)
262 distinct values 0
(0.0%)
79 RETINOL ACTIVITY EQUIVALENTS
[numeric]
Mean (sd) : 115 (817)
min < med < max:
0 < 5 < 30000
IQR (CV) : 43 (7.1)
464 distinct values 0
(0.0%)
80 DIETARY FOLATE EQUIVALENTS
[numeric]
Mean (sd) : 44.4 (114)
min < med < max:
0 < 15 < 5881
IQR (CV) : 39.4 (2.6)
333 distinct values 0
(0.0%)
81 FATTY ACIDS, POLYUNSATURATED, 18:2 c,c n-6, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 2.3 (3.9)
min < med < max:
0 < 2.3 < 74.6
IQR (CV) : 1.5 (1.7)
1286 distinct values 0
(0.0%)
82 FATTY ACIDS, POLYUNSATURATED, 20:3, EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.4
IQR (CV) : 0 (5.5)
90 distinct values 0
(0.0%)
83 FATTY ACIDS, POLYUNSATURATED, 18:3 c,c,c n-3 LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0.2 (1.1)
min < med < max:
0 < 0.1 < 53.4
IQR (CV) : 0.2 (5.6)
620 distinct values 0
(0.0%)
84 FATTY ACIDS, POLYUNSATURATED, 18:3 c,c,c n-6, g-LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1
IQR (CV) : 0 (17.6)
50 distinct values 0
(0.0%)
85 BETA CRYPTOXANTHIN
[numeric]
Mean (sd) : 15.2 (152)
min < med < max:
0 < 1 < 6252
IQR (CV) : 15.2 (10)
129 distinct values 0
(0.0%)
86 LYCOPENE
[numeric]
Mean (sd) : 220 (1390)
min < med < max:
0 < 0 < 46260
IQR (CV) : 220 (6.3)
191 distinct values 0
(0.0%)
87 LUTEIN AND ZEAXANTHIN
[numeric]
Mean (sd) : 260 (1063)
min < med < max:
0 < 104 < 19697
IQR (CV) : 260 (4.1)
420 distinct values 0
(0.0%)
88 FATTY ACIDS, POLYUNSATURATED, 20:3 n-6, EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.4
IQR (CV) : 0 (13.9)
72 distinct values 0
(0.0%)
89 FATTY ACIDS, POLYUNSATURATED, 20:4 n-6, ARACHIDONIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2)
228 distinct values 0
(0.0%)
90 FATTY ACIDS, POLYUNSATURATED, 20:3 n-3 EICOSATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1
IQR (CV) : 0 (15.9)
53 distinct values 0
(0.0%)
91 VITAMIN B12, ADDED
[numeric]
Mean (sd) : 1 (5)
min < med < max:
0 < 1 < 380
IQR (CV) : 0 (5.2)
29 distinct values 0
(0.0%)
92 ALPHA-TOCOPHEROL, ADDED
[numeric]
Mean (sd) : 0.1 (0.3)
min < med < max:
0 < 0.1 < 16.9
IQR (CV) : 0 (3.4)
12 distinct values 0
(0.0%)
93 VITAMIN D2, ERGOCALCIFEROL
[numeric]
Mean (sd) : 0.3 (0.5)
min < med < max:
0 < 0.3 < 28.1
IQR (CV) : 0 (1.5)
23 distinct values 0
(0.0%)
94 FATTY ACIDS, SATURATED, 4:0, BUTANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3.2
IQR (CV) : 0 (4.1)
275 distinct values 0
(0.0%)
95 FATTY ACIDS, SATURATED, 6:0, HEXANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 2
IQR (CV) : 0 (4)
225 distinct values 0
(0.0%)
96 ALPHA CAROTENE
[numeric]
Mean (sd) : 40.8 (297)
min < med < max:
0 < 1 < 14251
IQR (CV) : 40.8 (7.3)
165 distinct values 0
(0.0%)
97 FATTY ACIDS, MONOUNSATURATED, 22:1c, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (4)
101 distinct values 0
(0.0%)
98 FATTY ACIDS, POLYUNSATURATED, 18:3i, LINOLENIC, OCTADECATRIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (2.2)
55 distinct values 0
(0.0%)
99 FATTY ACIDS, MONOUNSATURATED, 22:1t, DOCOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (12.8)
17 distinct values 0
(0.0%)
100 SUCROSE
[numeric]
Mean (sd) : 2 (5)
min < med < max:
0 < 2 < 99.8
IQR (CV) : 2 (2.5)
488 distinct values 0
(0.0%)
101 GLUCOSE
[numeric]
Mean (sd) : 0.8 (1.7)
min < med < max:
0 < 0.8 < 35.8
IQR (CV) : 0.8 (2.2)
400 distinct values 0
(0.0%)
102 FRUCTOSE
[numeric]
Mean (sd) : 0.7 (1.7)
min < med < max:
0 < 0.7 < 55.6
IQR (CV) : 0.7 (2.4)
388 distinct values 0
(0.0%)
103 LACTOSE
[numeric]
Mean (sd) : 0.3 (0.8)
min < med < max:
0 < 0.3 < 13.2
IQR (CV) : 0.3 (2.7)
226 distinct values 0
(0.0%)
104 MALTOSE
[numeric]
Mean (sd) : 0.2 (0.5)
min < med < max:
0 < 0.2 < 16.4
IQR (CV) : 0.2 (2.6)
218 distinct values 0
(0.0%)
105 GALACTOSE
[numeric]
Mean (sd) : 0 (0.3)
min < med < max:
0 < 0 < 19.9
IQR (CV) : 0 (9.5)
54 distinct values 0
(0.0%)
106 FATTY ACIDS, SATURATED, 20:0, EICOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 4.6
IQR (CV) : 0 (2.7)
184 distinct values 0
(0.0%)
107 FATTY ACIDS, SATURATED, 22:0, DOCOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 3.7
IQR (CV) : 0 (3.4)
134 distinct values 0
(0.0%)
108 FATTY ACIDS, MONOUNSATURATED, 14:1, TETRADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 1.8
IQR (CV) : 0 (2.4)
157 distinct values 0
(0.0%)
109 FATTY ACIDS, MONOUNSATURATED, 20:1, EICOSENOIC
[numeric]
Mean (sd) : 0.1 (0.5)
min < med < max:
0 < 0 < 15
IQR (CV) : 0.1 (5.3)
366 distinct values 0
(0.0%)
110 FATTY ACIDS, SATURATED, 15:0, PENTADECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0 (1.7)
122 distinct values 0
(0.0%)
111 FATTY ACIDS, SATURATED, 17:0, HEPTADECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.8
IQR (CV) : 0 (1.2)
190 distinct values 0
(0.0%)
112 FATTY ACIDS, SATURATED, 24:0, TETRACOSANOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 4.7
IQR (CV) : 0 (5.2)
92 distinct values 0
(0.0%)
113 STARCH
[numeric]
Mean (sd) : 4 (6.7)
min < med < max:
0 < 4 < 73.3
IQR (CV) : 4 (1.7)
361 distinct values 0
(0.0%)
114 BETA-TOCOPHEROL
[numeric]
Mean (sd) : 0.1 (0.2)
min < med < max:
0 < 0.1 < 10.5
IQR (CV) : 0 (1.9)
66 distinct values 0
(0.0%)
115 GAMMA-TOCOPHEROL
[numeric]
Mean (sd) : 2.3 (2.1)
min < med < max:
0 < 2.3 < 65.2
IQR (CV) : 0 (0.9)
275 distinct values 0
(0.0%)
116 DELTA-TOCOPHEROL
[numeric]
Mean (sd) : 0.4 (0.5)
min < med < max:
0 < 0.4 < 15.4
IQR (CV) : 0 (1.2)
149 distinct values 0
(0.0%)
117 FATTY ACIDS, MONOUNSATURATED, 16:1t, HEXADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 6.1
IQR (CV) : 0 (10.8)
74 distinct values 0
(0.0%)
118 FATTY ACIDS, MONOUNSATURATED, 18:1t, OCTADECENOIC
[numeric]
Mean (sd) : 0.1 (0.4)
min < med < max:
0 < 0.1 < 20.2
IQR (CV) : 0 (2.9)
296 distinct values 0
(0.0%)
119 FATTY ACIDS, POLYUNSATURATED, 18:2i, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 2.3
IQR (CV) : 0 (1.8)
141 distinct values 0
(0.0%)
120 FATTY ACIDS, MONOUNSATURATED, 24:1c, TETRACOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.6
IQR (CV) : 0 (4.1)
46 distinct values 0
(0.0%)
121 FATTY ACIDS, MONOUNSATURATED, 16:1c, HEXADECENOIC
[numeric]
Mean (sd) : 0.1 (0.2)
min < med < max:
0 < 0.1 < 6.9
IQR (CV) : 0 (1.3)
397 distinct values 0
(0.0%)
122 FATTY ACIDS, POLYUNSATURATED, 20:2 c,c EICOSADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.7
IQR (CV) : 0 (1.7)
129 distinct values 0
(0.0%)
123 FATTY ACIDS, MONOUNSATURATED, 18:1c, OCTADECENOIC
[numeric]
Mean (sd) : 4.7 (37.8)
min < med < max:
0 < 4.7 < 2845
IQR (CV) : 0 (8.1)
1067 distinct values 0
(0.0%)
124 FATTY ACIDS, MONOUNSATURATED, 17:1, HEPTADECENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (1.5)
136 distinct values 0
(0.0%)
125 FATTY ACIDS, TOTAL TRANS-MONOENOIC
[numeric]
Mean (sd) : 0.1 (0.4)
min < med < max:
0 < 0.1 < 20.2
IQR (CV) : 0 (3.1)
286 distinct values 0
(0.0%)
126 FATTY ACIDS, MONOUNSATURATED, 15:1, PENTADECENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 6
IQR (CV) : 0 (15)
28 distinct values 0
(0.0%)
127 FATTY ACIDS, POLYUNSATURATED, CONJUGATED, 18:2 cla, LINOLEIC, OCTADECADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 1.1
IQR (CV) : 0 (2.2)
91 distinct values 0
(0.0%)
128 FATTY ACIDS, POLYUNSATURATED, 22:4 n-6, DOCOSATETRAENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (1.6)
67 distinct values 0
(0.0%)
129 FATTY ACIDS, TOTAL TRANS-POLYENOIC
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 2.5
IQR (CV) : 0 (1.8)
155 distinct values 0
(0.0%)
130 CHOLINE, TOTAL
[numeric]
Mean (sd) : 39.1 (50.2)
min < med < max:
0 < 39.1 < 2403
IQR (CV) : 20.4 (1.3)
911 distinct values 0
(0.0%)
131 BETAINE
[numeric]
Mean (sd) : 10.6 (13.8)
min < med < max:
0 < 10.6 < 630
IQR (CV) : 0 (1.3)
258 distinct values 0
(0.0%)
132 FATTY ACIDS, POLYUNSATURATED, TOTAL OMEGA N-3
[numeric]
Mean (sd) : 0.5 (1.4)
min < med < max:
0 < 0.5 < 53.4
IQR (CV) : 0.3 (2.9)
549 distinct values 0
(0.0%)
133 FATTY ACIDS, POLYUNSATURATED, TOTAL OMEGA N-6
[numeric]
Mean (sd) : 3.1 (13.8)
min < med < max:
0 < 3.1 < 953
IQR (CV) : 1.7 (4.5)
1056 distinct values 0
(0.0%)
134 ASPARTAME
[numeric]
Mean (sd) : 51.1 (49.6)
min < med < max:
0 < 51.1 < 3722
IQR (CV) : 0 (1)
0.00 : 82 ( 1.4%)
37.00 : 1 ( 0.0%)
42.00 : 1 ( 0.0%)
51.15!: 5603 (98.5%)
52.00 : 1 ( 0.0%)
597.00 : 1 ( 0.0%)
3722.00 : 1 ( 0.0%)
! rounded


0
(0.0%)
135 TOTAL PLANT STEROL
[numeric]
Mean (sd) : 26.4 (28.2)
min < med < max:
0 < 26.4 < 1190
IQR (CV) : 0 (1.1)
118 distinct values 0
(0.0%)
136 MANNITOL
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.2
IQR (CV) : 0 (8.6)
4 distinct values 0
(0.0%)
137 SORBITOL
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 2.3
IQR (CV) : 0 (6.9)
0.00 : 1375 (24.2%)
0.01!: 4304 (75.6%)
0.10 : 1 ( 0.0%)
0.20 : 1 ( 0.0%)
0.30 : 1 ( 0.0%)
0.60 : 2 ( 0.0%)
0.80 : 1 ( 0.0%)
1.00 : 3 ( 0.1%)
2.10 : 1 ( 0.0%)
2.30 : 1 ( 0.0%)
! rounded


0
(0.0%)
138 STIGMASTEROL
[numeric]
Mean (sd) : 1.3 (1.8)
min < med < max:
0 < 1.3 < 59
IQR (CV) : 0 (1.4)
27 distinct values 0
(0.0%)
139 TOTAL MONOSACCARIDES
[numeric]
Mean (sd) : 0.8 (1.5)
min < med < max:
0 < 0.8 < 30.6
IQR (CV) : 0.7 (1.9)
268 distinct values 0
(0.0%)
140 TOTAL DISACCHARIDES
[numeric]
Mean (sd) : 1.5 (2.8)
min < med < max:
0 < 1.5 < 47.2
IQR (CV) : 1.3 (1.9)
296 distinct values 0
(0.0%)
141 BETA-SITOSTEROL
[numeric]
Mean (sd) : 14 (14)
min < med < max:
0 < 14 < 621
IQR (CV) : 0 (1)
55 distinct values 0
(0.0%)
142 HYDROXYPROLINE
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0 < 0.1 < 0.7
IQR (CV) : 0 (0.4)
198 distinct values 0
(0.0%)
143 FATTY ACIDS, SATURATED, 13:0 TRIDECANOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (3)
11 distinct values 0
(0.0%)
144 FATTY ACIDS, POLYUNSATURATED, 21:5
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.3
IQR (CV) : 0 (4.7)
13 distinct values 0
(0.0%)
145 FATTY ACIDS, MONOUNSATURATED, 24:1undifferentiated, TETRACOSENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 3
IQR (CV) : 0 (8.9)
33 distinct values 0
(0.0%)
146 FATTY ACIDS, POLYUNSATURATED, 22:3,
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.1
IQR (CV) : 0 (5.5)
11 distinct values 0
(0.0%)
147 FATTY ACIDS, POLYUNSATURATED, 22:2, DOCOSADIENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0
IQR (CV) : 0 (6.5)
5 distinct values 0
(0.0%)
148 FATTY ACIDS, POLYUNSATURATED, 18:2t,t , OCTADECADIENENOIC
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.5
IQR (CV) : 0 (2.5)
60 distinct values 0
(0.0%)
149 CAMPESTEROL
[numeric]
Mean (sd) : 3.8 (3.6)
min < med < max:
0 < 3.8 < 189
IQR (CV) : 0 (0.9)
28 distinct values 0
(0.0%)
150 BIOTIN
[numeric]
Mean (sd) : 6.1 (0.9)
min < med < max:
0 < 6.1 < 31.6
IQR (CV) : 0 (0.1)
72 distinct values 0
(0.0%)
151 OXALIC ACID
[numeric]
Mean (sd) : 0.3 (0)
min < med < max:
0 < 0.3 < 1.7
IQR (CV) : 0 (0.1)
28 distinct values 0
(0.0%)

Optimal Number of Clusters

Optimal Number of Clusters for K-means

Scree Plot from PCA in Spectrum Clustering

Optimal Number of Clusters for Spectrum Clustering on 10 PCs

R Packages

List of R Packages Used

  • Matrix (version 1.5.1; Bates D et al., 2022)
  • SnowballC (version 0.7.1; Bouchet-Valat M, 2023)
  • clValid (version 0.7; Brock G et al., 2008)
  • summarytools (version 1.0.1; Comtois D, 2022)
  • fields (version 15.2; Douglas Nychka et al., 2021)
  • data.table (version 1.14.6; Dowle M, Srinivasan A, 2022)
  • tm (version 0.7.11; Feinerer I, Hornik K, 2023)
  • wordcloud (version 2.6; Fellows I, 2018)
  • car (version 3.1.1; Fox J, Weisberg S, 2019)
  • carData (version 3.0.5; Fox J et al., 2022)
  • glmnet (version 4.1.8; Friedman J et al., 2010)
  • spam (version 2.10.0; Furrer R et al., 2023)
  • viridisLite (version 0.4.2; Garnier et al., 2023)
  • fpc (version 2.2.12; Hennig C, 2024)
  • stargazer (version 5.2.3; Hlavac M, 2022)
  • NLP (version 0.2.1; Hornik K, 2020)
  • factoextra (version 1.0.7; Kassambara A, Mundt F, 2020)
  • sjPlot (version 2.8.12; Lüdecke D, 2022)
  • cluster (version 2.1.4; Maechler M et al., 2022)
  • varhandle (version 2.0.6; Mahmoudian M, 2023)
  • report (version 0.5.9; Makowski D et al., 2023)
  • imputeTS (version 3.3; Moritz S, Bartz-Beielstein T, 2017)
  • here (version 1.0.1; Müller K, 2020)
  • tibble (version 3.2.1; Müller K, Wickham H, 2023)
  • RColorBrewer (version 1.1.3; Neuwirth E, 2022)
  • R (version 4.2.2; R Core Team, 2022)
  • pacman (version 0.5.1; Rinker TW, Kurkiewicz D, 2018)
  • pROC (version 1.18.4; Robin X et al., 2011)
  • GGally (version 2.2.1; Schloerke B et al., 2024)
  • gtsummary (version 2.0.1; Sjoberg D et al., 2021)
  • corrplot (version 0.92; Wei T, Simko V, 2021)
  • ggplot2 (version 3.4.4; Wickham H, 2016)
  • forcats (version 0.5.2; Wickham H, 2022)
  • stringr (version 1.5.1; Wickham H, 2023)
  • tidyverse (version 1.3.2; Wickham H et al., 2019)
  • dplyr (version 1.1.4; Wickham H et al., 2023)
  • purrr (version 1.0.2; Wickham H, Henry L, 2023)
  • readr (version 2.1.5; Wickham H et al., 2024)
  • tidyr (version 1.3.1; Wickham H et al., 2024)

Citations for R Packages Used

References

Dani, Jennifer, Courtney Burrill, and Barbara Demmig-Adams. 2005. The remarkable role of nutrition in learning and behaviour.” Nutrition and Food Science 35 (4): 258–63. https://doi.org/10.1108/00346650510605658.
Shanta Retelny, Victoria, Annie Neuendorf, and Julie L. Roth. 2008. Nutrition protocols for the prevention of cardiovascular Disease.” Nutrition in Clinical Practice 23 (5): 468–76. https://doi.org/10.1177/0884533608323425.