How Ingredient Recognition Actually Works (And the Three Places It Breaks)
When we tell people CookSnap identifies ingredients from a photo of the fridge, the first question is “how does that work,” and the second question, asked more carefully, is “does it actually work.” The honest answer to the second is “most of the time, with specific known failures we’re working on.” The honest answer to the first takes a couple of paragraphs. Here it is.
The pipeline, in plain English
When you snap a fridge photo, the image goes through three sequential stages. None of them are AI magic; all of them are well-understood computer-vision components stacked in a way that’s specific to food.
- Object detection.A vision model scans the image and returns bounding boxes around things it thinks are discrete objects. Most modern systems use a variant of YOLO, a Vision Transformer, or a Gemini/Claude/GPT multi-modal API. Output at this stage is “there’s a thing at pixels (240, 180) to (390, 320).”
- Classification.Each bounding box is cropped out and passed to a classifier that returns “this is most likely a tomato (confidence 0.91), or possibly a red bell pepper (0.06).” This is the step that turns shapes into ingredient names.
- Canonicalization.The raw classifier output (“Roma tomato”) is mapped to a canonical ingredient (“tomato”) in our taxonomy, with the variety stored as metadata. This is the step that lets the downstream matcher work, because the matcher only knows the canonical taxonomy.
The CookSnap iOS app does the first two stages on-device using the iPhone’s Neural Engine, and the third stage in a thin cloud call. The on-device path matters for privacy (the fridge photo never leaves your phone) and for speed (sub-second identification of 15+ ingredients on an iPhone 12 or newer).
The three places this breaks
We are going to be honest about the failure modes because the category as a whole is dishonest about them.
- Things in opaque packaging.If your milk is in a carton with a brand label, the vision model sees a carton, not milk. We work around this with a label-text OCR pass for known brands, but if your fridge looks like a supermarket shelf, expect maybe 50% recall on packaged goods. We tell you which boxes the model couldn’t open, so to speak.
- Things partially hidden.Half a bunch of spinach behind a yogurt tub gets missed. Vegetables stored loose in the crisper drawer get missed because the model can’t see through the drawer. The current workaround is “take a second photo of the drawer” — ugly UX, but honest. We have a longer-term plan for multi-photo merge.
- Ambiguous things. The model can usually tell a tomato from a bell pepper. It cannot always tell a Granny Smith from a Honeycrisp. It cannot tell whole milk from skim milk without label OCR. For categories where the variety matters to the recipe matcher, we surface a confirmation step instead of guessing.
Why we don’t use a single model
A reasonable question: why not just feed the whole fridge photo to GPT-4o or Claude or Gemini and ask “what ingredients do you see.” You absolutely can, and a lot of newer cooking apps do exactly this. The output is usually impressive in a demo.
We tried this for six months. The failure mode was subtle: the model would confidently identify ingredients that weren’t in the photo because they were “the kind of thing you’d expect to see in a fridge.” Half-and-half. Mustard. Hot sauce. The model is over-confident on the “normal kitchen” prior even when those items aren’t visible.
A pipeline that does explicit bounding boxes and per-box classification forces the model to commit to spatial claims. It can’t hallucinate an ingredient if it has to point at where the ingredient is. That’s the safety property the pipeline buys you, and it’s why we kept it even after the multi-modal APIs got good.
What we’re still working on
Three open problems, named:
- Quantity estimation.We can tell you that you have eggs. We can’t reliably tell you that you have three eggs versus a dozen. The matcher currently assumes quantities are unconstrained, which is mostly fine but occasionally surfaces a recipe that needs more than you have.
- Cooked-vs-raw disambiguation.A photo of last night’s chicken in a Tupperware looks superficially like a photo of raw chicken. We can usually tell them apart by context (Tupperware vs. plastic packaging) but not always.
- Multi-day fridges.The model has no concept of “this jar of marinara has been open for three weeks.” We’re experimenting with passive freshness inference but it’s nowhere near shipping.
Why this matters for recipe matching
Every failure above propagates into the recipe match. If the model misses your spinach, the matcher returns a recipe that doesn’t use spinach. If the model misidentifies your Pecorino as Parmesan, the recipe technically still works but the dish is wrong.
This is why ingredient recognition has to be honest about its failure modes. A recipe finder that pretends its vision pipeline is perfect will quietly send users to dishes their fridge can’t actually produce. We’d rather surface a confirmation screen and look slightly less magical than ship a confidently wrong match.
If you want to try the pipeline, the iOS app handles fridge photos directly; the free web tool skips the vision step and takes typed ingredients, which is a cleaner test of the matcher itself.