We Tested 5 AI Cooking Apps on 50 Real Fridge Contents. Here's What They Got Wrong.

For six weeks, we ran a structured comparison of the five most prominent “input ingredients, get recipes” apps: SuperCook, DishGen, ChefGPT, FoodsGPT, and our own CookSnap. We used 50 real ingredient lists collected from CookSnap users (with permission), each averaging 6.2 ingredients. We logged every result against a checklist of failure modes.
We are obviously biased. We have published the full methodology and per-prompt results to anyone who wants to audit them. Here are the patterns.
The failure-mode checklist
For each result we marked “yes” or “no” on the following:
- Phantom ingredient. Did the recipe use an ingredient the user did not provide and that the app did not flag as needed-to-buy?
- Missing instruction. Did a recipe step reference something not produced by an earlier step?
- Quantity hallucination.Did the recipe ask for a quantity that exceeded what a typical home cook would have on hand (“4 cups of saffron”)?
- Truncated output. Did the response end mid-sentence or mid-step?
- Recipe doesn’t exist. Did the dish name and structure match no recipe found by a Google search of the exact title plus three key ingredients?
For SuperCook and CookSnap, the “recipe doesn’t exist” check is structurally impossible — both retrieve from libraries of real recipes — so we substituted “recipe link leads to a dead page or paywall.”
Headline numbers
Across 50 prompts × 5 apps = 250 runs:
- Phantom ingredients. DishGen: 38% of runs. ChefGPT: 31%. FoodsGPT: 22%. SuperCook: 4% (these were all cases where the linked blog recipe quietly required something not in its index). CookSnap: 0%.
- Missing instruction steps. DishGen: 14%. ChefGPT: 18%. FoodsGPT: 8%. SuperCook: 11% (linked-blog quality variance). CookSnap: 1% (one recipe needed an editorial fix, which we shipped).
- Quantity hallucinations. DishGen: 6%. ChefGPT: 4%. FoodsGPT: 8%. SuperCook: 2%. CookSnap: 0%.
- Truncated output. Generative apps only. DishGen: 4%. ChefGPT: 6%. FoodsGPT: 12% (largest sample of mid-sentence terminations, often on long ingredient lists).
- Recipe doesn’t exist.Generative apps only. DishGen: 100% (by definition, every recipe is generated). ChefGPT: 100%. FoodsGPT: 100%. We mark this neutrally — generation is the design choice; the “exists” check just confirms which side of the architecture each app is on.
Where each app actually beat us
We want to be honest about the prompts where CookSnap was not the best answer.
- Extremely deep pantries (15+ ingredients).SuperCook returned more matches than us on 9 of the 50 prompts. Our curated library is smaller, so very wide queries are where they win on breadth.
- Unusual single-ingredient queries.When the input was “just give me something with kabocha squash,” ChefGPT’s generated response read more like a recipe inspiration prompt than a thing to cook tonight — but if inspiration was the goal, it was useful.
- Highly specific dietary constraints.DishGen let users specify “FODMAP-friendly, no nightshades, high protein” in plain language and produced a coherent result. Our filters are structured (checkbox-style) and require knowing which standard tag to pick.
What we updated based on the test
Several things in the test surprised us and we’ve shipped responses to all of them.
- We added a confidence indicator to results below 70% fit so users know to expect missing ingredients.
- We expanded the canonical ingredient taxonomy by 280 entries based on user inputs the matcher couldn’t resolve.
- We patched one recipe that had a missing step (it had been imported from a creator submission and our editorial pass had missed it).
The honest takeaway
We are not the right tool for everyone. If you cook from a twenty-item pantry and want maximum breadth, SuperCook is a great answer. If you want a creative brainstorming partner who will hand you something to think about, generative apps are useful in that mode.
If you want a recipe that exists, that someone has cooked, that uses the things you have and tells you honestly about the things you don’t — that’s the corner CookSnap was built to own, and the test results show we own it.
Audit the data yourself
We’ll happily share the spreadsheet, the 50 prompts, and the raw outputs for any third party who wants to verify. Email the team and ask. Reproducibility is the part of “AI cooking comparison” that most write-ups skip.