
AI Can't Count Carbs—and That's the Whole Problem
A diabetes tech blogger ran 27,000 tests on the same meal. LLMs gave him 27,000 different answers. Time to stop pretending probabilistic models are deterministic tools.
ℹ️ Tarayıcı tabanlı sesli okuma · yapay zeka stüdyo sesi yakında
I spent Monday afternoon watching a developer named Tim Street lose his mind.
He'd fed the same photograph of the same meal into the same AI model 27,000 times. He wanted one number: how many carbohydrates are on this plate? The model never gave him the same answer twice.
Not approximately the same. Not close enough for government work. Literally never the same number twice across 27,000 queries.
This is not a diabetes story. This is the entire AI deployment story in a single experiment.
Street runs DiabetTech, a site dedicated to testing glucose monitors and insulin pumps. He's been writing about the promise of AI-assisted carb counting for months. The use case is perfect: people with Type 1 diabetes dose insulin based on carb estimates. Get it wrong by twenty grams and you're either crashing or spiking. An AI that could nail carb counts from photos would be genuinely life-changing.
So he tested it. Twenty-seven thousand times. And what he found is what every startup CTO privately knows but won't say on pitch calls: large language models are probabilistic by design. They sample from a distribution. They don't compute. They don't measure. They guess, eloquently, within a range.
For creative tasks—summarizing articles, drafting emails, brainstorming product names—that's fine. For anything where you need the same input to produce the same output, it's a category error.
The really damning part? Street wasn't even using a no-name model. He tested industry leaders. The stuff investors are pouring billions into. The tech that's supposed to automate radiology and legal discovery and financial auditing.
Here's the thing: Street's experiment isn't an outlier. It's a reproduction of what OpenAI, Anthropic, and Google all document in their own papers. Temperature settings, sampling methods, non-deterministic tokenization—these aren't bugs. They're features. The models are built to be creative, not consistent.
Which raises the uncomfortable question nobody wants to ask during a Series B: what percentage of current AI deployments are solving problems that actually require determinism?
I've watched three fintech startups pitch AI-powered fraud detection in the last month. I've seen healthcare companies demo diagnostic assistants. I've sat through compliance automation decks. Every single one leans hard on accuracy metrics—F1 scores, precision-recall curves, benchmark performance.
None of them mentioned reproducibility.
None of them addressed what happens when the same transaction, run through the same model six months apart, gets flagged once and cleared five times.
The gap between lab benchmarks and production reliability is where the current wave of AI companies will die. Benchmarks measure aggregate performance across thousands of examples. Production systems need per-instance consistency. Those are not the same thing.
Street's carb-counting test is brutal because the stakes are legible. Dose wrong, go to the hospital. But the same dynamic plays out in less visible domains. Contract review tools that flag different clauses on successive reads. Code review assistants that approve a pull request one day and reject it the next. Customer support bots that escalate identical tickets inconsistently.
You can paper over some of this with engineering. Set temperature to zero. Use greedy decoding. Cache outputs. But you're fighting the architecture. You're trying to make a storytelling engine do arithmetic.
The smarter move—and I'm seeing this in maybe five percent of pitches—is to be honest about what LLMs are good for and build hybrid systems. Use the model to generate candidates. Use deterministic code to validate and rank them. Don't ask the AI to be the source of truth; ask it to be a very good guesser that feeds into a classical pipeline.
Earth AI, the critical minerals startup that just made news for vertically integrating its exploration stack, gets this. They're using models to identify promising geological formations, then validating with actual drilling and assays. The AI accelerates search. It doesn't replace measurement.
That's the pattern I want to see more of: AI as a filter, not an oracle.
Because here's what happens when you deploy a non-deterministic system in a deterministic use case: you get variance. And in regulated industries, in medical contexts, in anything involving money or safety, unexplained variance is called liability.
I talked to a legal tech founder last week who's pivoting away from automated contract generation. They spent eight months trying to get consistent output. Couldn't do it. Now they're rebuilding as a clause suggestion tool with human review. They'll move slower. They'll also ship something that works.
The carb-counting story is going to haunt a lot of pitch decks. Street published his methodology. Other people are going to run the same tests on their domains. Customer support tickets. Invoice processing. Medical triage. Every use case where people assumed accuracy on a test set means reliability in production.
Some of those tests are going to be fine. Translation, summarization, content generation—those tolerate variance. But any application where determinism matters is about to have a reckoning.
The tell is going to be how companies respond. The ones that say 'temperature zero solves this' or 'our fine-tuning fixes it' are lying to themselves. The ones that say 'you're right, we're rebuilding the system to use AI as a component, not the whole stack' might survive.
I've been writing about AI washing for two years. This is the next phase. It's not about companies claiming to use AI when they don't. It's about companies using AI for tasks it's fundamentally not designed for, then discovering the mismatch in production.
Street's experiment is a gift to every CTO who's been told to 'just add AI.' It's a twenty-seven-thousand-iteration proof that probabilistic models can't do deterministic work. You can quote that on your next sprint planning call.
The question now is whether the market figures this out before or after the next funding round. My money's on after. But the startups that pivot early—that rebuild around hybrid architectures, that sell AI augmentation instead of AI replacement—are going to be the ones still here in two years.
Because you can raise on a demo. But you can't retain customers on a model that gives a different answer every time.