AI Underwriting Models: The Accuracy Trap

July 4, 2022 Astrid Holm

Almost every AI underwriting company we meet presents a similar slide: model performance on a test set, expressed as AUC or Gini coefficient, compared against the carrier's existing actuarial model. The AI model is typically better. Sometimes substantially better. The implication is clear: the model is superior, and therefore the carrier should deploy it.

This framing reflects a real capability but obscures the question that actually determines competitive sustainability. High test-set accuracy is achievable by many teams working from the same or similar data. It is a capability, not a moat. The question worth asking — and the one most founders in this space are not asked frequently enough — is: what happens to that accuracy advantage when a well-funded competitor trains on the same data?

The Data Sourcing Question

The first dimension of the accuracy trap is data sourcing. The majority of insurance ML models at seed stage are trained on one of three types of data: carrier-provided proprietary data under a pilot agreement, commercially licensed third-party data (credit bureau, vehicle data, geospatial indices), or publicly available proxies. The first type creates a dependency on a specific carrier relationship. The second and third types are available to any team with a data budget.

If your model is trained primarily on commercially available data, any competitor who licenses the same data and applies similar modeling techniques can approximate your performance. Your AUC advantage may be 4–5 points today because you built earlier and spent more time on feature engineering. In 18 months, a well-funded competitor building from the same data sources will have closed most of that gap. The accuracy advantage is temporary.

The models that maintain accuracy advantages do so through proprietary data collection — data signals that only you are collecting because of a specific relationship, a specific technology, or a specific access agreement. Telematics data collected through a hardware partnership. Claims development data from a consortium of carriers who share it only with your platform. Behavioral data from a distribution partner who only routes through you. The accuracy comes from the data, and the data is defensible because of a structural relationship, not because of engineering effort alone.

Overfitting to Historical Distributions

There is a second version of the accuracy trap that is more technically subtle. Insurance underwriting models trained on historical claims data face an inherent distributional shift problem: the data generating process in the future is not identical to the data generating process in the past.

This is always true in forecasting. But it is acutely true in insurance underwriting right now because several of the factors that drove historical loss distributions are in structural transition. Motor claims severity is rising faster than general inflation because of EVs and the cost of sensor-dense bodywork repair. Climate frequency is shifting the loss distribution for property peril cover in ways that historical data from 2000–2020 does not fully capture. Medical inflation is distorting long-tail liability development patterns.

A model that achieves high accuracy by fitting tightly to a historical loss distribution — maximizing test-set AUC on historical holdout data — can underperform in production because the distribution has shifted. The accuracy on the historical test set was real. The accuracy in production is different. Carriers who discover this after deploying the model have a significant problem with reserve adequacy and pricing cycle discipline.

The models we prefer are the ones with explicit distributional shift monitoring built in — teams that have instrumented their production deployments to detect when the model's outputs are drifting from observed outcomes, and who have a governance process for triggering retraining when that drift exceeds a defined threshold. This is not exotic ML operations. It is standard practice. But many seed-stage underwriting AI companies have not yet built it, because they have not yet had production deployments long enough for distributional shift to surface.

What Actually Creates Defensibility

If accuracy is not the moat, what is? Based on our portfolio experience, the sustainable competitive advantages in AI underwriting technology fall into three categories.

The first is proprietary data accumulation. A model that ingests data from a specific production deployment accumulates labeled outcomes — actual claims against the policies that were priced using the model — that allow continuous improvement. After 24 months of production deployment at a carrier, the model has been exposed to a distribution of outcomes that is specific to that carrier's portfolio and unavailable to a competitor. The production feedback loop creates compounding advantage.

The second is workflow integration. A model embedded deeply into the carrier's underwriting workflow — generating not just a risk score but a structured recommendation that maps to the carrier's specific appetite and pricing bands, integrated with the core policy admin system — has switching cost that a model delivered as an API score does not. The carrier's underwriters have adapted their process around the tool. Replacing it requires a workflow re-engineering project, not just a model substitution.

The third is regulatory positioning. EIOPA guidelines on AI use in underwriting, and the anticipated requirements under the EU AI Act for high-risk AI systems, create a compliance cost for model deployment that disadvantages new entrants. A model that has been deployed, stress-tested, and documented in accordance with current guidance has a compliance foundation that a new entrant needs to rebuild from scratch. This is not the most exciting moat, but it is real and it compounds over time.

Asking the Right Questions in Diligence

When we evaluate an AI underwriting company at seed stage, we do not spend much time on the test-set performance slide. We spend time on: where is your training data, who else has access to it, what is your strategy for accumulating proprietary data through production deployments, and how does your model's accuracy behave under distributional shift?

These questions are sometimes uncomfortable for founders because they surface the places where the competitive case is weakest. But they are also the questions a well-run carrier will ask before signing a production deployment agreement. Founders who can answer them clearly — and who have a honest account of where their model is defensible and where it is not — are the ones who will navigate the carrier procurement process successfully.

The accuracy trap is not fatal to the category. AI underwriting models genuinely improve on actuarial methods in specific applications. The point is that accuracy is a starting requirement, not a destination. The destination is a system that gets more accurate over time because of a defensible data advantage — not one that achieves a benchmark on historical data and then competes on that number indefinitely.