As large language models grow increasingly capable, one weakness continues to linger in plain sight: hallucinations. Even the most advanced systems still produce confident-sounding errors, and while the industry has developed various ways to detect them, a definitive solution remains elusive.
Probably, a newly funded startup, believes it has a path toward tightening that gap. The company has raised $9 million in seed funding from Andreessen Horowitz, and its mission is straightforward but ambitious: stop factual errors and hallucinations from ever reaching the end user.
Founder Peter Elias describes the goal as pushing AI systems toward “five nines” reliability—99.99% accuracy—something routine in deterministic software systems, but notoriously difficult when working with probabilistic models. Achieving that level of precision, he argues, requires rethinking core assumptions in how AI applications are engineered.
The company’s first product is a data science tool designed to generate rapid insights from complex datasets. Each output is paired with citations and a full audit trail, reflecting a growing expectation for transparency in AI-assisted analysis.
But ensuring correctness required more than just better prompting. Elias and his team built what he calls a “data science mech suit”—an external system that tightly wraps the model’s outputs in verification logic. Every LLM-generated answer is checked against a deterministic validator, and anything that fails to align with the underlying dataset is rejected and reworked. The model itself is also trained to operate within this constraint, creating a feedback loop aimed at consistency and precision.
“What we learned building this was that the better your harness engineering is, the weaker the model can be,” Elias says. “If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it’s an exercise in reducing ambiguity.”
This approach allows Probably’s system to run on relatively small models—Elias claims versions that are “four classes weaker than frontier models.” In practice, that makes it possible to operate on local hardware, such as a desktop machine, rather than relying on large-scale cloud infrastructure, significantly cutting token costs.
At a moment when AI usage costs are under scrutiny and enterprises are reassessing budgets, that efficiency is a compelling proposition. And Elias sees the architecture as more than a niche tool for data science. In his view, the same system could extend into any domain where precision is critical—accounting, healthcare, and beyond.
“I think it’s really interesting that the big AI labs have not even attempted to do this,” Elias says. “They’re incentivized not to, because they make money the more times you have to correct the model.”



