Building an AI Estimator for Software Projects

Estimation is one of the oldest unsolved problems in software engineering. Teams consistently under-estimate, scope creep is real, and the person who estimates is rarely the person who builds. We set out to build an AI tool that could produce reliable first-pass estimates from a project description — and the results surprised us.

The Problem We Were Actually Solving

Before building anything, we mapped the real pain points in our estimation workflow. The issues were not that developers were bad at estimating individual tasks — they were actually quite good. The problem was the accumulation of tasks that were overlooked, the coordination overhead that was always underestimated, and the time it took just to produce a structured breakdown from a vague brief.

Our target was not to replace engineering judgment. It was to automate the scaffolding: given a product brief, produce a structured task breakdown with categories, dependencies, and rough hour ranges — fast enough that a sales conversation could use it on the spot.

Architecture

Input: A natural language project description (from a form or pasted from a brief).
Enrichment: An LLM pass to extract and classify requirements — features, integrations, non-functional requirements, and unknowns.
Breakdown generation: A second LLM call using few-shot examples from our historical project database to generate a structured task list in JSON.
Estimation: Each task is matched by embedding similarity to historical tasks, and the p50/p90 hour ranges from those matches are applied.
Refinement: The estimate is shown to an engineer who can adjust tasks, add missing items, and flag assumptions. Every correction is fed back as a training signal.

The Historical Data Problem

The biggest challenge was bootstrapping the historical task database. We had time-tracking data but it was messy — tasks named inconsistently, estimates mixed with actuals, and widely varying granularity.

We spent three weeks cleaning and standardising two years of project data. This was tedious but non-negotiable — garbage-in, garbage-out is never more true than in estimation systems.

We ended up with about 4,200 cleaned task records across 80 projects, spanning web apps, mobile apps, API integrations, and AI features. Enough to provide meaningful similarity matches for the most common task categories.

Results After Six Months

Time to first estimate draft: from ~3 hours to ~25 minutes (70% reduction).
Estimation accuracy (within 20% of final): improved from 58% to 74% of projects at the point of sale.
Overlooked task categories: the AI consistently flags testing, DevOps setup, and third-party integration scoping that humans frequently omit in initial estimates.
Engineer acceptance rate: 82% of AI-generated breakdowns were used with only minor modifications.

What We Learned

The most valuable output of the tool is not the hour ranges — it is the structured task breakdown that forces clients and engineers to agree on scope before work begins. Even when estimates are wrong, the breakdown surfaces implicit assumptions early.

The embedding-similarity approach to matching historical tasks is simple and interpretable. Engineers trust it more than a black-box model because they can see which past projects a task was matched to.

Do not skip the human-in-the-loop step. The tool became trusted only because engineers could see exactly what it produced and correct it easily. An end-to-end automatic system would have been adopted much more slowly.