Building a Data Pipeline for USDA FoodData Central: From Raw Files to a Structured Dataset
An in-depth guide to building a pipeline that transforms the USDA FoodData Central dataset into a unified, consolidated format, ready for AI and machine learning applications.
ai
data pipeline
usda
rag
project
nlp
Author
Jack Tol
Published
November 8, 2024
Important Note!
The code for the different functions of the data pipeline and USDA Food Assistant are explained in detail. The code blocks in these sections are hidden by default but can be viewed by opening up the “Show Code” folding subsection, which is present at the top of such sections.
Note
The USDA Food Data Pipeline, Utility Files, & USDA Food Data Assistant code can be found on the main project GitHub, linked here. The cleaned USDA Branded Food Dataset, created by the pipeline covered in this post, is available on HuggingFace Datasets, linked here. Additionally, the USDA Food Assistant can be accessed on HuggingFace Spaces, linked here.
Section 1 | Introduction
In data science and machine learning, creating a clean, well-structured dataset is often the most critical step in a successful project. Recently, I discovered the U.S. Department of Agriculture (USDA) FoodData Central (FDC) dataset, a comprehensive and publicly accessible resource that provides exhaustive information on foods available on U.S. consumer shelves. This dataset spans 34 separate CSV files and includes details on ingredients, branded food items, and nutrient information, making it a rich resource for food data analysis. However, the data is spread across multiple files, each covering different aspects, which poses a challenge for direct use in machine learning models.
For instance, key files include:
branded_food.csv - Contains product-specific information, including brand details, serving sizes, and ingredients.
"fdc_id","brand_owner","brand_name","subbrand_name","gtin_upc","ingredients","not_a_significant_source_of","serving_size","serving_size_unit","household_serving_fulltext","branded_food_category","data_source","package_weight","modified_date","available_date","market_country","discontinued_date","preparation_state_code","trade_channel","short_description""750930","Kellogg Company US","","","00038000127441","Whole grain wheat, rice, sugar, milled corn, wheat bran, contains 2% or less of salt, malt flavor, BHT for freshness.Vitamins and Minerals: Vitamin C (ascorbic acid), niacinamide, reduced iron, vitamin B6 (pyridoxine hydrochloride), vitamin B2 (riboflavin), vitamin B1 (thiamin hydrochloride), vitamin A palmitate, folic acid, vitamin B12, vitamin D3.","","56.0","g","1 Pouch","Cereals Products - Ready to Eat (Shelf Stable)","GDSN","","2017-11-07","2020-02-27","United States","","","",""
food.csv - Provides descriptive names and categories for food items.
"fdc_id","data_type","description","food_category_id","publication_date""750930","branded_food","Kellogg's Frosted Flakes Cereal Multi Grain 2oz","Cereals Products - Ready to Eat (Shelf Stable)","2020-02-27"
food_nutrient.csv - Lists nutrient data for each food item.
To make this dataset suitable for analysis, I began by consolidating these files. I filtered branded_food.csv to retain only the most recent records for each food item, then used food.csv to retrieve clear, descriptive names standardized under the FOOD_NAME label. By cross-referencing these with food_nutrient.csv, I ensured only relevant, up-to-date data was included for each item, resulting in a refined, comprehensive dataset.
To achieve this, I developed a pipeline that automates the steps of downloading, cleaning, filtering, merging, and normalizing the data. In this blog post, I’ll guide you through the code and logic behind this pipeline, showing how I transformed the USDA FoodData Central data into a unified, machine-learning-ready dataset.
The complete dataset, containing over 456,000 rows that detail the identification, serving size, ingredients, and nutrient profiles of branded food items available on U.S. consumer shelves, is available on HuggingFace, linked here.
An example row from the finalized dataset is shown below:
"FOOD_RECORD_ID","FOOD_ID","FOOD_NAME","FOOD_SERVING_SIZE","FOOD_INGREDIENTS","ACETIC ACID (MG)","ALANINE (G)","ALCOHOL, ETHYL (G)","ARGININE (G)","ASH (G)","ASPARTIC ACID (G)","BETA-GLUCANS (G)","BIOTIN (UG)","CAFFEINE (MG)","CALCIUM, CA (MG)","CARBOHYDRATE, BY DIFFERENCE (G)","CARBOHYDRATE, OTHER (G)","CAROTENE, BETA (UG)","CHLORINE, CL (MG)","CHOLESTEROL (MG)","CHOLINE, FROM PHOSPHOTIDYL CHOLINE (MG)","CHOLINE, TOTAL (MG)","CHROMIUM, CR (UG)","COPPER, CU (MG)","CYSTEINE (G)","CYSTINE (G)","ENERGY (KCAL)","ENERGY (KJ)","EPIGALLOCATECHIN-3-GALLATE (MG)","FATTY ACIDS, TOTAL MONOUNSATURATED (G)","FATTY ACIDS, TOTAL POLYUNSATURATED (G)","FATTY ACIDS, TOTAL SATURATED (G)","FATTY ACIDS, TOTAL TRANS (G)","FIBER, INSOLUBLE (G)","FIBER, SOLUBLE (G)","FIBER, TOTAL DIETARY (G)","FLUORIDE, F (UG)","FOLATE, DFE (UG)","FOLATE, FOOD (UG)","FOLATE, TOTAL (UG)","FOLIC ACID (UG)","FRUCTOSE (G)","GLUCOSE (G)","GLUTAMIC ACID (G)","GLUTAMINE (G)","GLYCINE (G)","HISTIDINE (G)","INOSITOL (MG)","INULIN (G)","IODINE, I (UG)","IRON, FE (MG)","ISOLEUCINE (G)","LACTIC ACID (MG)","LACTOSE (G)","LEUCINE (G)","LIGNIN (G)","LUTEIN + ZEAXANTHIN (UG)","LYSINE (G)","MAGNESIUM, MG (MG)","MANGANESE, MN (MG)","METHIONINE (G)","MOLYBDENUM, MO (UG)","NIACIN (MG)","PANTOTHENIC ACID (MG)","PHENYLALANINE (G)","PHOSPHORUS, P (MG)","POTASSIUM, K (MG)","PROLINE (G)","PROTEIN (G)","PUFA 18:2 (G)","PUFA 18:2 N-6 C,C (G)","PUFA 18:3 N-3 C,C,C (ALA) (G)","RETINOL (UG)","RIBOFLAVIN (MG)","RIBOSE (G)","SALT, NACL (MG)","SELENIUM, SE (UG)","SERINE (G)","SFA 10:0 (G)","SFA 12:0 (G)","SFA 8:0 (G)","SODIUM, NA (MG)","SORBITOL (G)","STARCH (G)","SUGARS, ADDED (G)","SUGARS, INTRINSIC (G)","SUGARS, TOTAL (G)","TAURINE (G)","THIAMIN (MG)","THREONINE (G)","TOTAL FAT (NLEA) (G)","TOTAL LIPID (FAT) (G)","TOTAL SUGAR ALCOHOLS (G)","TOTAL SUGARS (G)","TRYPTOPHAN (G)","TYROSINE (G)","VALINE (G)","VITAMIN A (UG)","VITAMIN A, IU (IU)","VITAMIN A, RAE (UG)","VITAMIN B-12 (UG)","VITAMIN B-6 (MG)","VITAMIN C, TOTAL ASCORBIC ACID (MG)","VITAMIN D (D2 + D3) (UG)","VITAMIN D (D2 + D3), INTERNATIONAL UNITS (IU)","VITAMIN D2 (ERGOCALCIFEROL) (UG)","VITAMIN D3 (CHOLECALCIFEROL) (UG)","VITAMIN E (ALPHA-TOCOPHEROL) (MG)","VITAMIN E (LABEL ENTRY PRIMARILY) (IU)","VITAMIN E (MG)","VITAMIN E (MG_ATE)","VITAMIN K (PHYLLOQUINONE) (UG)","WATER (G)","XYLITOL (G)","ZINC, ZN (MG)""750930","00038000127441","KELLOGG'S FROSTED FLAKES CEREAL MULTI GRAIN 2OZ","56.0 G","WHOLE GRAIN WHEAT, RICE, SUGAR, MILLED CORN, WHEAT BRAN, CONTAINS 2% OR LESS OF SALT, MALT FLAVOR, BHT FOR FRESHNESS.VITAMINS AND MINERALS: VITAMIN C (ASCORBIC ACID), NIACINAMIDE, REDUCED IRON, VITAMIN B6 (PYRIDOXINE HYDROCHLORIDE), VITAMIN B2 (RIBOFLAVIN), VITAMIN B1 (THIAMIN HYDROCHLORIDE), VITAMIN A PALMITATE, FOLIC ACID, VITAMIN B12, VITAMIN D3.","","","","",1.4,"","","","",20.0,86.6,"","","",0.0,"","","","","","",360.0,"","",0.2,0.8,0.3,0.0,7.0,1.0,8.0,"","","",357.0,"","","","","","","","","","",16.1,"","","","","","","",85.0,"","","",17.9,"","",270.0,279.0,"",7.7,"","","","",1.52,"","","","","","","",597.0,"","","","","","",1.0,"","",1.3,"",24.4,"","","","",1786.0,"",5.4,1.79,54.0,"",143.0,"","","",0.0,"","","",3.0,"",2.3
Section 2 | Downloading & Extracting the Data We Need
In this section, we’ll cover the first steps of the pipeline: downloading the USDA FoodData Central dataset and preparing it for further processing. This involves automating the retrieval of the latest dataset, extracting relevant files from the downloaded zip archive, and organizing the files within our working directory. By the end of this section, we’ll have the essential raw data files extracted and ready for cleaning and transformation.
2.1 | Download USDA FoodData Central Data
Show Code
def download_usda_food_data(): options = Options() options.add_argument("--log-level=3") driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) driver.get('https://fdc.nal.usda.gov/download-datasets.html')try: match = re.search(r'Full Download of All Data Types.*?href="(/fdc-datasets/FoodData_Central_csv_.*?\.zip)"', driver.page_source, re.DOTALL) download_link =f"https://fdc.nal.usda.gov{match.group(1)}"if match elseNoneif download_link: response = requests.get(download_link, stream=True) filepath = os.path.join(os.getcwd(), download_link.split('/')[-1])withopen(filepath, 'wb') asfile:for chunk in response.iter_content(chunk_size=8192):file.write(chunk)return filepathelse:print("CSV link not found.")returnNonefinally: driver.quit()
To begin the data processing pipeline, we need to download the latest USDA FoodData Central (FDC) dataset from their website. This dataset is provided as a zip file containing multiple CSV files. The download_usda_food_data function automates this initial step by navigating the USDA website, locating the download link for the zip file, and saving it to your working directory.
This function utilizes Selenium to open the USDA dataset download page in a browser and extract the specific link to the zip file. It searches the page’s HTML content with a regular expression to find the relevant download link for the dataset. Once the link is identified, the function streams the download using the requests library and saves the zip file locally.
After successfully downloading the file, the function returns the path to the zip file, which is essential for the subsequent steps in the pipeline. If the download link cannot be found, the function will output an error message and return None.
2.2 | Extract ZIP File
Show Code
def extract_zip(zip_path, extract_to):with zipfile.ZipFile(zip_path, 'r') as zip_ref: zip_ref.extractall(extract_to)
With the zip file downloaded, the next step is to extract its contents. The extract_zip function handles this by taking the path to the zip file and an extraction directory as input, unzipping the dataset into the specified folder.
This function uses Python’s zipfile library to extract the zip file’s contents, including all the CSV files in the FDC dataset. This extracted data will then be available for the pipeline’s data cleaning and processing stages.
After extracting the zip file, we need to isolate only the specific CSV files required for the pipeline. The move_target_files function facilitates this by identifying and moving only the necessary files, branded_food.csv, food.csv, food_nutrient.csv, and nutrient.csv, from the extracted folder to the working directory.
This function iterates through the list of target files, checks if each file exists in the specified source folder, and moves them to the working directory. This ensures that only relevant files are available for the next steps, keeping the workspace organized.
2.4 | Clean Up
Show Code
def cleanup(files_to_delete):for path in files_to_delete:if os.path.isfile(path): os.remove(path)elif os.path.isdir(path): shutil.rmtree(path)
After moving the required files to the working directory, the original zip file and leftover extracted folder can be removed to save space. The cleanup function handles this by accepting a list of paths to delete, ensuring no unnecessary files remain in the working environment.
This function iterates through each path in files_to_delete, checking if it is a file or a directory. If it’s a file, it deletes it using os.remove; if it’s a directory, it removes it and all its contents using shutil.rmtree. This step keeps the workspace tidy and minimizes storage use.
Section 3 | Data Cleaning & Preparation
In this section, we’ll dive into the functions responsible for transforming the raw data files into their cleaned versions. These functions are the core processing components of the pipeline, focusing on extracting and refining the most recent records for each food item, identified by the gtin_upc field. This field uniquely tracks items over time, allowing us to filter for the latest entries and ensure accuracy.
Each function in this section contributes to preparing a refined, usable dataset that is ready for analysis. We’ll explore how these functions work together to achieve a clean, structured dataset.
The clean_branded_food function identifies and retains only the most recent entries for each food item in branded_food.csv. Since each gtin_upc uniquely identifies a food item, the function sorts the data by gtin_upc and fdc_id (record ID) in descending order, keeping only the latest record for each gtin_upc.
After filtering for the latest records, the function renames key columns for consistency and readability: fdc_id is renamed to FOOD_RECORD_ID, gtin_upc to FOOD_ID, and ingredients to FOOD_INGREDIENTS. Additionally, the function formats the serving size data by combining serving_size and serving_size_unit into a single FOOD_SERVING_SIZE column, standardizing units and ensuring proper formatting. Text fields are stripped of extra spaces and converted to uppercase for consistency, and numeric serving sizes are rounded to two decimal places.
The output is a refined DataFrame with only the latest entries, ready for further processing in the pipeline.
The clean_food function associates each FOOD_RECORD_ID with a descriptive name, which aids in identifying each food item in the final dataset. The description field in food.csv typically contains both the brand and item name, making it essential for clarity.
The function begins by renaming columns for consistency: fdc_id becomes FOOD_RECORD_ID, and description is renamed to FOOD_NAME. It then formats FOOD_NAME to uppercase for uniformity across entries. Finally, the function filters the rows to include only those records that are present in cleaned_branded_food_df, thereby ensuring that only the latest entries are retained.
The output is a streamlined DataFrame with FOOD_RECORD_ID and FOOD_NAME for each relevant food item, ready to be merged with other cleaned datasets in the pipeline.
The clean_nutrient function standardizes the file which holds general nutrient information by renaming columns and constructing a combined label for each nutrient. It takes in the nutrient.csv data, renaming columns to NUTRIENT_ID, NUTRIENT_NAME, and NUTRIENT_UNIT for clarity. The function then creates a new column, FOOD_NUTRIENT_NAME, which combines the nutrient name and unit in uppercase, providing a clear, unified label for each nutrient in the dataset. This label makes it easy to identify each nutrient and its measurement unit.
The clean_food_nutrient function processes the food_nutrient.csv data, which includes nutrient details for each food record (fdc_id). First, it renames columns to FOOD_RECORD_ID, NUTRIENT_ID, and NUTRIENT_QUANTITY for consistency. Then, it filters the data, retaining only the records where FOOD_RECORD_IDs are present in cleaned_branded_food_df, ensuring the dataset includes only the latest food entries.
After filtering, the function groups the data by FOOD_RECORD_ID and NUTRIENT_ID, calculating the mean NUTRIENT_QUANTITY for cases with multiple entries. Finally, it pivots the dataset so each row represents the most up-to-date record for a unique food item (FOOD_RECORD_ID), with columns for each nutrient (NUTRIENT_ID). This structure provides a clear, consolidated view of nutrient quantities per food item.
Section 4 | Final Data Integration & Validation
In this section, we finalize the dataset by mapping nutrient names, merging all cleaned data, and applying thresholds to ensure data quality. This phase is essential for transforming raw, segmented data into a cohesive and reliable dataset ready for analysis.
The map_nutrient_names_to_nutrient_ids function simplifies the nutrient data by creating a mapping between nutrient IDs and their descriptive names, based on cleaned_nutrient_df. It then renames the columns in cleaned_food_nutrient_df with these descriptive names, making the nutrient data more interpretable in the final dataset.
This function creates a dictionary called nutrient_map that pairs each NUTRIENT_ID with its corresponding FOOD_NUTRIENT_NAME. It then applies this mapping to cleaned_food_nutrient_df by renaming the columns, resulting in a DataFrame where each nutrient column has a meaningful label (e.g., IRON (MG) instead of just 1089).
4.2 | Merge Cleaned Data into Final DataFrame
Show Code
def merge_cleaned_data_into_final_df(cleaned_branded_food_df, cleaned_food_df, mapped_nutrient_names_to_nutrient_ids_df): merged_data = pd.merge(cleaned_branded_food_df, cleaned_food_df, on='FOOD_RECORD_ID', how='inner') final_data = pd.merge(merged_data, mapped_nutrient_names_to_nutrient_ids_df, on='FOOD_RECORD_ID', how='inner') ordered_columns = ['FOOD_RECORD_ID', 'FOOD_ID', 'FOOD_NAME', 'FOOD_SERVING_SIZE', 'FOOD_INGREDIENTS'] +sorted( [col for col in final_data.columns if col notin ['FOOD_RECORD_ID', 'FOOD_ID', 'FOOD_NAME', 'FOOD_SERVING_SIZE', 'FOOD_INGREDIENTS']] )return final_data[ordered_columns]
The merge_cleaned_data_into_final_df function brings together all the cleaned data into one comprehensive DataFrame. It merges cleaned_branded_food_df, cleaned_food_df, and mapped_nutrient_names_to_nutrient_ids_df on FOOD_RECORD_ID, ensuring that each unique food item has all relevant data combined into a single row. Additionally, it orders the columns to prioritize core information like FOOD_NAME and FOOD_ID, followed by each nutrient in alphabetical order.
This function performs two main tasks:
Merging DataFrames: It performs an inner join on FOOD_RECORD_ID to combine the data from cleaned_branded_food_df, cleaned_food_df, and mapped_nutrient_names_to_nutrient_ids_df.
Column Ordering: It creates an ordered list of columns, placing primary fields like FOOD_RECORD_ID, FOOD_ID, FOOD_NAME, FOOD_SERVING_SIZE, and FOOD_INGREDIENTS first, followed by nutrient columns sorted alphabetically.
The result is a tidy, well-organized DataFrame where each row represents a unique food item, labeled by FOOD_ID and FOOD_NAME, among other bits of information, with each nutrient clearly identified. This structure makes the dataset highly informative and accessible for downstream analysis.
4.3 | Define Nutrient Thresholds
Show Code
thresholds = {'VITAMIN A, IU (IU)': 333333,'VITAMIN D (D2 + D3), INTERNATIONAL UNITS (IU)': 4000000,'VITAMIN E (LABEL ENTRY PRIMARILY) (IU)': 1493,'G': 100,'MG': 100000,'UG': 100000000,'ENERGY (KCAL)': 900,'ENERGY (KJ)': 3766,}
We define nutrient thresholds to manage outliers and enhance data quality, followed by steps to remove invalid serving sizes. These thresholds are essential to preparing the dataset for reliable analysis by addressing inconsistencies and filtering out potentially misleading values.
The thresholds are specifically set for each nutrient per 100 grams (or milliliters) of the raw food item, as nutrient data is provided on this basis. For most nutrients, the threshold is a maximum allowable value for 100 grams. For nutrients measured in International Units (IU), we apply unique thresholds based on each nutrient’s 100 gram scale, since IU measurements vary widely in concentration across different nutrients. Additionally, for energy metrics, we set thresholds in both kilocalories (KCAL) and kilojoules (KJ) at the theoretical maximum amount of energy storable in 100 grams of food.
This dictionary of thresholds enables us to maintain data consistency by flagging any values that exceed these limits. For example, entries with vitamin D values above 4,000,000 IU or energy values exceeding 900 KCAL per 100 grams are considered outliers, helping to ensure data accuracy and reliability in the final dataset.
The apply_nutrient_thresholds function enforces the predefined nutrient thresholds across each nutrient column. For each nutrient, it identifies the appropriate threshold based on either the nutrient’s specific name or its unit (e.g., MG, IU). Since the thresholds are defined per 100 grams (or milliliters) of the raw food item, this function filters out any values that exceed these limits, replacing them with NaN to flag them as outliers.
This function iterates through each nutrient column, determining the threshold based on either the nutrient’s specific name or unit. For nutrients measured in International Units (IU), it applies nutrient-specific thresholds due to the unique scaling requirements of each nutrient type. Additionally, for energy values (in kilocalories and kilojoules), thresholds reflect the theoretical maximum energy storable in 100 grams of food.
By applying these thresholds, nutrient values that exceed their respective limits are flagged with NaN, signaling values beyond the acceptable range and maintaining data quality. Finally, all nutrient values are rounded to two decimal places to ensure consistency and readability in the dataset.
The remove_invalid_serving_sizes function ensures data consistency in the FOOD_SERVING_SIZE column by filtering out entries that contain inappropriate units like IU, which are not valid for serving size measurements. It also removes rows with missing values in the FOOD_SERVING_SIZE column.
This function performs two main tasks to clean the serving size data:
It removes rows where FOOD_SERVING_SIZE contains the unit IU, as this unit is specific to nutrient measurements rather than serving sizes.
It drops any rows with missing values in the FOOD_SERVING_SIZE column to ensure all entries have meaningful serving size data.
By addressing these inconsistencies, the function helps maintain the accuracy and interpretability of serving size data in the final dataset.
Section 5 | Running the Full Data Pipeline
In this final section, we execute the complete data processing pipeline. The execute_pipeline function orchestrates all previously defined steps, from downloading and cleaning the raw USDA data files to merging, applying thresholds, and exporting the final dataset. This function enables a fully automated transformation of raw data into a structured, machine-learning-ready dataset, ensuring that each step is consistently applied for reliable and reproducible results.
The execute_pipeline function orchestrates each step of the data processing pipeline, transforming raw USDA food data into a clean, structured dataset ready for analysis. Here’s a walkthrough of how each function is integrated to achieve this:
Download the Data:
The pipeline starts by calling download_usda_food_data, which scrapes the USDA website to find and download the latest dataset zip file.
If a valid zip file is successfully downloaded, the pipeline proceeds to the next step.
Extract and Organize Files:
The extract_zip function extracts the contents of the downloaded zip file to the current working directory.
Next, move_target_files relocates the specific files needed, branded_food.csv, food.csv, nutrient.csv, and food_nutrient.csv, to the working directory for easy access.
The cleanup function then deletes the original zip file and any leftover extracted folders to keep the workspace organized.
Load Data into DataFrames:
After extraction, each target file is loaded into a separate DataFrame using pd.read_csv. These DataFrames, branded_food_df, food_df, nutrient_df, and food_nutrient_df, serve as the input data for subsequent cleaning functions.
Clean Each DataFrame:
clean_branded_food: Filters the branded_food.csv data to retain only the latest record for each unique food item (identified by gtin_upc). This DataFrame includes essential fields like FOOD_RECORD_ID, FOOD_ID, FOOD_INGREDIENTS, and FOOD_SERVING_SIZE.
clean_food: Matches each FOOD_RECORD_ID in food.csv with a descriptive FOOD_NAME field, filtering for only the latest records.
clean_nutrient: Prepares a mapping of NUTRIENT_ID to FOOD_NUTRIENT_NAME by renaming and formatting fields in nutrient.csv.
clean_food_nutrient: Filters food_nutrient.csv to retain nutrient data only for the latest records and pivots the data to make each row represent one food item, with columns for each nutrient.
Map Nutrient Names to IDs:
Using map_nutrient_names_to_nutrient_ids, the pipeline replaces NUTRIENT_ID columns in the nutrient data with descriptive names, making the dataset more readable.
Merge Cleaned Data:
merge_cleaned_data_into_final_df combines all cleaned DataFrames into a single, cohesive DataFrame. It performs inner joins based on FOOD_RECORD_ID, ensuring each row represents a unique food item with all relevant details in one place, from FOOD_NAME and FOOD_ID to each nutrient.
Apply Final Data Cleaning:
The pipeline applies several final transformations to ensure data quality:
apply_nutrient_thresholds: Filters out any nutrient values exceeding predefined thresholds, replacing them with NaN if they are considered outliers.
remove_invalid_serving_sizes: Removes rows with serving sizes that contain invalid units (e.g., IU) and any missing serving sizes.
The column FOOD_RECORD_ID is then converted to a string to ensure consistent data type formatting.
Save & Clean Up:
The cleaned and fully processed DataFrame is saved to usda_branded_food_data.csv, ensuring it’s properly formatted for future use.
Lastly, all intermediate files (branded_food.csv, food.csv, nutrient.csv, food_nutrient.csv) are removed from the working directory to keep the environment clean.
Section 6 | Reviewing the Final Output
The pipeline produces a single, consolidated CSV file named usda_branded_food_data.csv, containing up-to-date, well-organized information for each food item. Each row in this dataset represents a unique, standardized record with clear, descriptive columns for nutrients, serving sizes, and ingredients, making it ideal for analysis.
For example, here’s a final output row for a sample item:
"FOOD_RECORD_ID","FOOD_ID","FOOD_NAME","FOOD_SERVING_SIZE","FOOD_INGREDIENTS","ACETIC ACID (MG)","ALANINE (G)","ALCOHOL, ETHYL (G)","ARGININE (G)","ASH (G)","ASPARTIC ACID (G)","BETA-GLUCANS (G)","BIOTIN (UG)","CAFFEINE (MG)","CALCIUM, CA (MG)","CARBOHYDRATE, BY DIFFERENCE (G)","CARBOHYDRATE, OTHER (G)","CAROTENE, BETA (UG)","CHLORINE, CL (MG)","CHOLESTEROL (MG)","CHOLINE, FROM PHOSPHOTIDYL CHOLINE (MG)","CHOLINE, TOTAL (MG)","CHROMIUM, CR (UG)","COPPER, CU (MG)","CYSTEINE (G)","CYSTINE (G)","ENERGY (KCAL)","ENERGY (KJ)","EPIGALLOCATECHIN-3-GALLATE (MG)","FATTY ACIDS, TOTAL MONOUNSATURATED (G)","FATTY ACIDS, TOTAL POLYUNSATURATED (G)","FATTY ACIDS, TOTAL SATURATED (G)","FATTY ACIDS, TOTAL TRANS (G)","FIBER, INSOLUBLE (G)","FIBER, SOLUBLE (G)","FIBER, TOTAL DIETARY (G)","FLUORIDE, F (UG)","FOLATE, DFE (UG)","FOLATE, FOOD (UG)","FOLATE, TOTAL (UG)","FOLIC ACID (UG)","FRUCTOSE (G)","GLUCOSE (G)","GLUTAMIC ACID (G)","GLUTAMINE (G)","GLYCINE (G)","HISTIDINE (G)","INOSITOL (MG)","INULIN (G)","IODINE, I (UG)","IRON, FE (MG)","ISOLEUCINE (G)","LACTIC ACID (MG)","LACTOSE (G)","LEUCINE (G)","LIGNIN (G)","LUTEIN + ZEAXANTHIN (UG)","LYSINE (G)","MAGNESIUM, MG (MG)","MANGANESE, MN (MG)","METHIONINE (G)","MOLYBDENUM, MO (UG)","NIACIN (MG)","PANTOTHENIC ACID (MG)","PHENYLALANINE (G)","PHOSPHORUS, P (MG)","POTASSIUM, K (MG)","PROLINE (G)","PROTEIN (G)","PUFA 18:2 (G)","PUFA 18:2 N-6 C,C (G)","PUFA 18:3 N-3 C,C,C (ALA) (G)","RETINOL (UG)","RIBOFLAVIN (MG)","RIBOSE (G)","SALT, NACL (MG)","SELENIUM, SE (UG)","SERINE (G)","SFA 10:0 (G)","SFA 12:0 (G)","SFA 8:0 (G)","SODIUM, NA (MG)","SORBITOL (G)","STARCH (G)","SUGARS, ADDED (G)","SUGARS, INTRINSIC (G)","SUGARS, TOTAL (G)","TAURINE (G)","THIAMIN (MG)","THREONINE (G)","TOTAL FAT (NLEA) (G)","TOTAL LIPID (FAT) (G)","TOTAL SUGAR ALCOHOLS (G)","TOTAL SUGARS (G)","TRYPTOPHAN (G)","TYROSINE (G)","VALINE (G)","VITAMIN A (UG)","VITAMIN A, IU (IU)","VITAMIN A, RAE (UG)","VITAMIN B-12 (UG)","VITAMIN B-6 (MG)","VITAMIN C, TOTAL ASCORBIC ACID (MG)","VITAMIN D (D2 + D3) (UG)","VITAMIN D (D2 + D3), INTERNATIONAL UNITS (IU)","VITAMIN D2 (ERGOCALCIFEROL) (UG)","VITAMIN D3 (CHOLECALCIFEROL) (UG)","VITAMIN E (ALPHA-TOCOPHEROL) (MG)","VITAMIN E (LABEL ENTRY PRIMARILY) (IU)","VITAMIN E (MG)","VITAMIN E (MG_ATE)","VITAMIN K (PHYLLOQUINONE) (UG)","WATER (G)","XYLITOL (G)","ZINC, ZN (MG)""750930","00038000127441","KELLOGG'S FROSTED FLAKES CEREAL MULTI GRAIN 2OZ","56.0 G","WHOLE GRAIN WHEAT, RICE, SUGAR, MILLED CORN, WHEAT BRAN, CONTAINS 2% OR LESS OF SALT, MALT FLAVOR, BHT FOR FRESHNESS.VITAMINS AND MINERALS: VITAMIN C (ASCORBIC ACID), NIACINAMIDE, REDUCED IRON, VITAMIN B6 (PYRIDOXINE HYDROCHLORIDE), VITAMIN B2 (RIBOFLAVIN), VITAMIN B1 (THIAMIN HYDROCHLORIDE), VITAMIN A PALMITATE, FOLIC ACID, VITAMIN B12, VITAMIN D3.","","","","",1.4,"","","","",20.0,86.6,"","","",0.0,"","","","","","",360.0,"","",0.2,0.8,0.3,0.0,7.0,1.0,8.0,"","","",357.0,"","","","","","","","","","",16.1,"","","","","","","",85.0,"","","",17.9,"","",270.0,279.0,"",7.7,"","","","",1.52,"","","","","","","",597.0,"","","","","","",1.0,"","",1.3,"",24.4,"","","","",1786.0,"",5.4,1.79,54.0,"",143.0,"","","",0.0,"","","",3.0,"",2.3
When we parse this data, it’s transformed into a clean, structured format:
General Food Information:
FOOD_RECORD_ID: 750930
FOOD_ID: 00038000127441
FOOD_NAME: KELLOGG’S FROSTED FLAKES CEREAL MULTI GRAIN 2OZ
FOOD_SERVING_SIZE: 56.0 G
FOOD_INGREDIENTS: WHOLE GRAIN WHEAT, RICE, SUGAR, MILLED CORN, WHEAT BRAN, CONTAINS 2% OR LESS OF SALT, MALT FLAVOR, BHT FOR FRESHNESS. VITAMINS AND MINERALS: VITAMIN C (ASCORBIC ACID), NIACINAMIDE, REDUCED IRON, VITAMIN B6 (PYRIDOXINE HYDROCHLORIDE), VITAMIN B2 (RIBOFLAVIN), VITAMIN B1 (THIAMIN HYDROCHLORIDE), VITAMIN A PALMITATE, FOLIC ACID, VITAMIN B12, VITAMIN D3.
Nutrient Information:
Nutrient Name (Unit)
Quantity Per 100 Grams
Ash (G)
1.4
Calcium, Ca (MG)
20.0
Carbohydrate, By Difference (G)
86.6
Cholesterol (MG)
0.0
Energy (KCAL)
360.0
Fatty Acids, Total Monounsaturated (G)
0.2
Fatty Acids, Total Polyunsaturated (G)
0.8
Fatty Acids, Total Saturated (G)
0.3
Fiber, Insoluble (G)
7.0
Fiber, Soluble (G)
1.0
Fiber, Total Dietary (G)
8.0
Folate, DFE (UG)
357.0
Magnesium, Mg (MG)
85.0
Niacin (MG)
17.9
Phosphorus, P (MG)
270.0
Potassium, K (MG)
279.0
Protein (G)
7.7
Riboflavin (MG)
1.52
Sodium, Na (MG)
597.0
Sugars, Total (G)
24.4
Thiamin (MG)
1.0
Total Fat (NLEA) (G)
1.3
Vitamin A (UG)
1786.0
Vitamin B-6 (MG)
1.79
Vitamin C, Total Ascorbic Acid (MG)
54.0
Vitamin D (D2 + D3) (UG)
5.4
Zinc, Zn (MG)
2.3
This final dataset is now reliable and streamlined, with validated serving sizes, standardized nutrient data, and outliers removed. It’s structured for easy readability and ready for machine learning or other analytical applications.
Section 7 | USDA Food Assistant | How I Have Used the Dataset
To leverage the USDA food dataset I created, I developed an interactive application called USDA Food Assistant. This tool allows users to explore food data in a conversational format using retrieval-augmented generation (RAG). Hosted on HuggingFace Spaces, it combines structured data retrieval with natural language processing, enabling users to access detailed food information interactively.
7.1 | How the USDA Food Assistant Works
The USDA Food Assistant operates in two main phases:
Data Retrieval: When users input a food item name, the assistant performs a semantic search in the Pinecone Vector Store. Using the multilingual-e5-large embedding model, it locates relevant food data, retrieving food name, serving size, ingredient, and nutrient details, which it then stores as context for ongoing conversation.
Interactive Conversation: Once the data is loaded, users can ask detailed questions about the food item, such as:
“Does this food contain any allergens?”
“What are its vitamins and minerals?”
“How many calories and grams of sugar are in a 250-gram serving?”
The assistant’s language model provides contextually accurate answers, allowing users to delve into the food item’s data. This ensures a smooth experience for those looking to make informed choices about food content and nutrition.
7.2 | Preparing & Uploading the Data
Show Code
def upload_data_to_pinecone(): pc = Pinecone() df = pd.read_csv('usda_branded_food_data.csv', low_memory=False, dtype={'FOOD_RECORD_ID': str}) texts = [] metadatas = []for _, row in df.iterrows(): food_name = row['FOOD_NAME'] texts.append(food_name) metadata = {column: row[column] for column in df.columns if pd.notna(row[column]) and column !='FOOD_NAME'} metadata["FOOD_NAME"] = food_name metadatas.append(metadata) batch_size =90 index_name ="branded-food-data"ifnot pc.has_index(index_name): pc.create_index( name=index_name, dimension=1024, metric="cosine", spec=ServerlessSpec( cloud='aws', region='us-east-1' ) ) whilenot pc.describe_index(index_name).status['ready']: time.sleep(1) index = pc.Index(index_name)def generate_embeddings_with_retry(pc, text_batch, max_retries=5):for attempt inrange(max_retries):try: embeddings = pc.inference.embed( model="multilingual-e5-large", inputs=text_batch, parameters={"input_type": "passage", "truncate": "END"} )return embeddingsexceptExceptionas e:print(f"Embedding generation failed on attempt {attempt +1}: {e}")if attempt < max_retries -1: time.sleep(2** attempt)else:print("Max retries for embedding generation reached. Skipping this batch.")returnNonedef upload_documents_with_retry(index, texts, metadatas, batch_size, max_retries=15):for i inrange(0, len(texts), batch_size): text_batch = texts[i:i + batch_size] metadata_batch = metadatas[i:i + batch_size] embeddings = generate_embeddings_with_retry(pc, text_batch)if embeddings isNone:continue records = []for j, embedding inenumerate(embeddings): records.append({"id": str(i + j),"values": embedding['values'],"metadata": metadata_batch[j] })for attempt inrange(max_retries):try: index.upsert(vectors=records)print(f"Uploaded {len(records)} records to Pinecone index '{index_name}' on attempt {attempt +1}")breakexceptExceptionas e:print(f"Attempt {attempt +1} failed with error: {e}")if attempt < max_retries -1: time.sleep(2** attempt)else:print("Max retries reached. Some records may not have been uploaded.")
The first step involved structuring and uploading the food data by reading usda_branded_food_data.csv, which the pipeline outputs, generating semantic embeddings for each food item, and uploading them to Pinecone.
Data Preparation: The CSV file is read, with FOOD_NAME used for embeddings, and other columns stored as metadata.
Embedding Generation: Using the multilingual-e5-large model, embeddings are created to capture each food name’s meaning.
Index Creation and Upload: If the "branded-food-data" index doesn’t exist, it’s created with cosine similarity as the metric. Embeddings and metadata are then uploaded in batches.
Once the data is uploaded, we now need to think about how to retrieve the data from the Pinecone index in order to give it to the LLM for context. The process is as follows:
Similarity Search: For a query (e.g., “oreo cookies”), we generate an embedding using the same multilingual-e5-large model and and perform a similarity across the Pinecone index with respect to the input query.
Formatting Results: The top k matches are retrieved, I choose 10, with their metadata and formatted for easy readability. The code for the retrievval can be found at the top of this sub-section, or in my Github, linked here.
For instance, a query result for “oreo cookies” looks like this:
1. OREO COOKIES
Food Details:
CARBOHYDRATE, BY DIFFERENCE (G): 73.0
FOOD_ID: 07622300688455
FOOD_RECORD_ID: 770088
FIBER, TOTAL DIETARY (G): 2.5
PROTEIN (G): 5.0
TOTAL SUGARS (G): 38.0
TOTAL LIPID (FAT) (G): 20.0
FATTY ACIDS, TOTAL SATURATED (G): 9.8
FOOD_INGREDIENTS: INGREDIENTS: WHEAT FLOUR, SUGAR, VEGETABLE OILS (PALM, PALM KERNEL), FAT REDUCED COCOA POWDER 4,6%, WHEAT STARCH, GLUCOSE-FRUCTOSE SYRUP, RAISING AGENTS (POTASSIUM HYDROGEN CARBONATE, AMMONIUM HYDROGEN CARBONATE, SODIUM HYDROGEN CARBONATE), SALT, EMULSIFIERS (SOYA LECITHIN, SUNFLOWER LECITHIN), FLAVOURING. MAY CONTAIN MILK.
SODIUM, NA (MG): 400.0
FOOD_SERVING_SIZE: 100.0 G
2. OREO COOKIE PIECES
Food Details:
SODIUM, NA (MG): 550.0
CARBOHYDRATE, BY DIFFERENCE (G): 71.43
FOOD_ID: 018894250975
FOOD_RECORD_ID: 2105489
FIBER, TOTAL DIETARY (G): 0.0
PROTEIN (G): 7.14
TOTAL SUGARS (G): 42.86
TOTAL LIPID (FAT) (G): 21.43
FOOD_INGREDIENTS: UNBLEACHED ENRICHED FLOUR,(WHEAT FLOUR, NIACIN, REDUCED IRON, THIAMINE MONONITRATE, (VIT. B1) RIBOFLAVIN, (VIT. B2), FOLIC ACID), SUGAR, PALM &/OR CANOLA OIL, COCOA (PROCESSED WITH ALKALI), HIGH FRUCTOSE CORN SYRUP, LEAVENING (BAKING SODA &/OR CALCIUM PHOSPHATE), SALT, SOY LECITHIN, CHOCOLATE, ARTIFICIAL FLAVOR.
ENERGY (KCAL): 500.0
FOOD_SERVING_SIZE: 14.0 G
3. OREO COOKIE PIECES
Food Details:
FOOD_ID: 604485906514
FOOD_RECORD_ID: 2127841
VITAMIN D (D2 + D3), INTERNATIONAL UNITS (IU): 0.0
TOTAL SUGARS (G): 38.89
TOTAL LIPID (FAT) (G): 22.22
FATTY ACIDS, TOTAL TRANS (G): 0.0
FATTY ACIDS, TOTAL SATURATED (G): 5.56
POTASSIUM, K (MG): 0.0
ENERGY (KCAL): 500.0
FOOD_SERVING_SIZE: 18.0 G
CALCIUM, CA (MG): 0.0
CARBOHYDRATE, BY DIFFERENCE (G): 77.78
FIBER, TOTAL DIETARY (G): 5.6
PROTEIN (G): 5.56
IRON, FE (MG): 0.0
CHOLESTEROL (MG): 0.0
FOOD_INGREDIENTS: UNBLEACHED ENRICHED FLOUR (WHEAT FLOUR, NIACIN, REDUCED IRON, THIAMINE MONONITRATE, (VIT. B1) RIBOFLAVIN, (VIT. B2), FOLIC ACID), SUGAR, PALM &/OR CANOLA OIL, COCOA (PROCESSED WITH ALKALI), HIGH FRUCTOSE CORN SYRUP, LEAVENING (BAKING SODA &/OR CALCIUM PHOSPHATE), SALT, SOY LECITHIN, CHOCOLATE, ARTIFICIAL FLAVOR.
SUGARS, ADDED (G): 38.9
SODIUM, NA (MG): 483.0
7.4 | Building the Interactive USDA Food Assistant
The USDA Food Assistant, available on HuggingFace Spaces, combines retrieval and natural language processing for a responsive experience. It uses Chainlit for the frontend, Pinecone for semantic search, and GPT-4o for language generation.
Here’s how the application integrates these components:
User Query Handling: When a user submits a food item or specific query, Chainlit handles the interaction.
Semantic Search: The application uses the inputted query, passes through to the embedding model, and searches the branded-food-data Pinecone index, to find the most similar food items.
Data Retrieval and Formatting: Food data (food name, identification numbers, ingredients, nutrients, serving sizes) is retrieved from Pinecone, specifically from the metadata of the top results, and dynamically inserted into a prompt template, along with the users original query which provides the food item they are referring to. This prompt template is then sent to a GPT-4o inference model which has clear instructions on what to do with the data its provided and what its role is.
Dynamic Responses: Upon gathering the context, the model offers an overview and suggests a few sample questions like:
“Does this food contain any allergens?”
“Provide a nutrient breakdown, including vitamins and minerals.”
“How many calories are in a 250-gram serving?”
Enhanced User Experience: After reading the overview, the user can engage in an interactive back-and-forth conversation about the food data which has been retrieved.
With this application, users gain access to USDA food data in way which has previously been impossible, which allows the user to make more informed decisions based on the food they are eating, or planning one eating, depending of the ingredients, nutrient profile, and serving size details.
You can explore the assistant on HuggingFace Spaces. Additionally, the code for the metadata pipeline, utility files, and the assistant is available on my GitHub, linked here.
Section 8 | Conclusion
This project demonstrates the process of transforming raw USDA FoodData Central data into a structured, analysis-ready dataset. By building a robust pipeline, we automated the steps of downloading, cleaning, filtering, merging, and validating the data, resulting in a cohesive dataset suitable for machine learning and data analysis.
Each stage in the pipeline enhances data quality and consistency: outdated records are filtered, nutrient names are mapped for readability, and threshold checks are applied to remove outliers. The final output offers a comprehensive view of each food item, complete with clear identifiers, descriptive names, and standardized nutrient values, making it an ideal resource for any data-driven application, in addition to areas such as nutrition science and consumer research.
Finally, I demonstrate a practical application of the dataset in my USDA Food Assistant application. By uploading this data to my Pinecone index, I can retrieve detailed information by extracting metadata associated with vectors representing food names. This enables a natural language query containing a branded food item to retrieve all relevant data for that food. I then integrate a GPT-4o model to support dynamic, interactive responses based on the retrieved food data.
I look forward to seeing how people in the Machine Learning & Data Science community utilize this dataset.
References
McKillop, Kyle. 2024. “Personal Meeting with USDA FoodData Central Lead on Dataset.”