The Death of Legal AI’s

The Armored Car Problem:

Why Proprietary Legal AI Is Solving the Wrong Problem

and Why the Market Will Eventually Notice

 

Buckle up folks. This one is going to make a lot of people unhappy. 

Three years ago, when the first hallucinated citations made national news and attorneys began facing sanctions for filing AI-generated briefs they never verified, the legal technology industry identified its market opportunity. The fear was real, the headlines were damaging, and attorneys needed reassurance. Proprietary legal AI vendors stepped into that gap with a straightforward value proposition: our models hallucinate less than the general-purpose alternatives. Use us, and you are safer.

That value proposition was credible at the time. It may have even been correct. But it rested on a set of assumptions about the technology landscape that are no longer true, and it optimized for a fear that is quietly fading as the legal profession becomes more fluent with these tools. The question this article poses is not whether proprietary legal AI was ever useful. It is whether the structural advantages those products claim still justify their cost, their constraints, and their lock-in, given what has changed in the past eighteen months.

The short answer is that they do not. The longer answer requires walking through what actually changed, why the safety argument is thinner than it appears, and what is already replacing it in practice.

The information advantage is eroding

The foundational premise of most proprietary legal AI products is that they have access to legal information that general-purpose models do not. They are connected to case law databases, statutory repositories, and regulatory archives through retrieval-augmented generation, commonly known as RAG. When you ask a question, the system searches a curated legal database, retrieves relevant documents, and feeds them to the underlying language model as context for generating an answer.

This was a meaningful differentiator when the frontier models operated exclusively from their training data, with no internet access and no ability to search external sources in real time. A model that could only recall what it had seen during training was fundamentally limited. It could not know about a case decided last month. It could not confirm whether a statute had been amended. It could not verify its own citations against a live database.

That limitation no longer exists in the way it once did. The major frontier models now have internet access. Google’s Gemini models are integrated with live search. Perplexity AI was built from the ground up as a search-augmented research engine that grounds every answer in cited sources retrieved in real time. Claude, ChatGPT, and others can browse the web, fetch specific pages, and work with documents you provide. The retrieval capability that once required a specialized legal platform is now a standard feature of the tools attorneys already have access to.

The underlying legal information these vendors retrieve is overwhelmingly public record. Case law is published on CourtListener, Google Scholar, and through individual court systems. Federal and state statutes are available on government websites. Regulatory materials are published in the Federal Register and state administrative codes. The proprietary legal AI did not create this information or secure exclusive access to it. It built a retrieval interface on top of public data.

This needs to be stated with precision, though, because the claim has limits. CourtListener does not have everything Westlaw has. Unpublished opinions, many state trial court decisions, and deep historical materials have real coverage gaps in free databases. And citator services like KeyCite and Shepard’s, which tell you whether a case has been overruled, distinguished, or questioned, have no free equivalent that matches their reliability. A frontier model browsing Google Scholar cannot replicate that function with the same confidence.

Likewise, RAG over a curated legal database with structured metadata, headnotes, and editorial enhancements is qualitatively different from a frontier model browsing the open web. The curated database has known coverage boundaries. Web browsing retrieves whatever the search engine surfaces, with all the unevenness that implies.

But here is what matters for the market question: for the majority of routine legal research tasks, the gap has closed. A practitioner who instructs a frontier model to search CourtListener for a specific case, or to retrieve the current text of a statute from a state legislature’s website, is performing the same retrieval operation the legal AI performs, without the intermediary and without the markup. A practitioner using Perplexity is getting search-grounded, citation-backed answers drawn from the same public legal sources, with the added benefit of being able to see exactly where each piece of information came from. The proprietary advantage in information access has not disappeared entirely. But it has narrowed to the point where it no longer justifies the pricing premium for most practitioners doing most legal work. And that narrowing is accelerating, not reversing.

The hallucination score: what it measures and what it does not

With the information access advantage diminished, the remaining differentiator for proprietary legal AI is the hallucination rate. Vendors point to benchmarking studies, particularly the Stanford HAI research, to demonstrate that their products produce fewer errors than general-purpose models. The numbers are real. Stanford’s evaluation found that Lexis+ AI produced hallucinated responses approximately 17% of the time and that Westlaw’s AI-Assisted Research hallucinated 33% of the time (Stanford HAI, 2024). General-purpose models tested on the same legal queries performed worse, with error rates reported between 58% and 82%.

But the Stanford study contains another finding that the marketing materials do not feature. Lexis+ AI answered only 65% of queries accurately. Westlaw’s AI-Assisted Research was accurate only 42% of the time. The gap between those numbers tells you something important: a large portion of responses were neither hallucinated nor accurate. They were incomplete, partially correct, or unresponsive. The hallucination rate is not the inverse of the accuracy rate. A product can avoid making things up and still fail to give you what you need. This distinction matters, because the vendors sell the hallucination score as a proxy for reliability, and it is not.

Even taking the hallucination numbers at face value, they deserve scrutiny, not dismissal. A 17% error rate is meaningfully better than a 75% error rate. In a vacuum, a single query to a legal AI is more likely to return a correct answer than a single query to a general-purpose model operating without guardrails. That is a real advantage, and intellectual honesty requires acknowledging it.

But the advantage raises two questions that the marketing materials do not address.

The first question is what was traded to achieve the score. The primary mechanisms vendors use to reduce hallucination rates are well understood in the research literature. They include lowering the temperature setting, which reduces the stochasticity of the model’s token selection; applying aggressive sampling constraints that narrow the range of possible outputs; and fine-tuning the model through supervised learning and reinforcement learning from human feedback to favor safe, hedged responses. Zhang et al., in their comprehensive survey of hallucination in large language models, identify these approaches as addressing primarily what they call fact-conflicting hallucination, where the model generates content that misaligns with established world knowledge (Zhang et al., “Siren’s Song in the AI Ocean,” Computational Linguistics, 2025, pp. 1373–1418).

These constraints work. They reduce the frequency of outright fabrication. But they carry trade-offs that are invisible to the buyer. A model with a low temperature setting is less likely to invent a case citation. It is also less likely to make the inferential leaps that constitute the most valuable form of AI-assisted legal work: issue spotting in unfamiliar areas, identifying non-obvious connections between doctrines, synthesizing disparate authorities into a coherent strategic framework. The constraints that improve the safety score simultaneously degrade the synthesis capability that makes AI a paradigm shift rather than a faster search engine.

Think of it as adding armor to a car. You can bolt steel plates onto every panel. The crash test rating will improve. The car is, by one measure, safer. But the added weight destroys its handling, its fuel efficiency, and its utility as a vehicle for getting somewhere. What you really want is not a heavier car. What you want is a better-built car with a driver who knows the road.

And here is the part that should concern the vendors most. The hallucination problem is not improving at the model level. It is getting worse. OpenAI disclosed in its system card for the o3 model, released in April 2025, that o3 hallucinated 33% of the time on the PersonQA benchmark, more than double the rate of the earlier o1 model’s 16% (OpenAI o3 System Card, April 2025; TechCrunch, April 18, 2025). If the most well-resourced AI company in the world is producing models that are more capable and less reliable at the same time, the notion that hallucination is a problem that will be engineered away at the model level is not holding up.

This trend creates a compounding problem for proprietary legal AI that does not exist for practitioners using frontier models directly. When the underlying models hallucinate more, the vendors must apply even more aggressive constraints to maintain their Stanford scores. More constraint means more degradation of the synthesis and reasoning capabilities that make the tools valuable. The better the vendors get at suppressing hallucination, the more they hollow out the intelligence of the product. Meanwhile, a practitioner using unconstrained frontier models absorbs the hallucination risk through methodology rather than model restriction, and retains the full reasoning power. Increasing hallucination rates at the model level hurt proprietary legal AI disproportionately, because the vendors are in a race to keep a score that requires them to sacrifice the thing their customers actually need.

The second question is whether the score measures the right thing. The Stanford benchmark, and most hallucination benchmarks, test whether the model produces factually incorrect outputs in response to legal queries under controlled conditions. That is a useful measurement. It is not a sufficient one.

In practice, the failures that burn attorneys are not the kind of failures benchmarks catch. The most dangerous errors are contextual, not factual. They include jurisdiction blending, where the model applies a legal standard from the wrong state or court system; assumption smuggling, where the model inserts facts that sound like record evidence but were never provided; posture mismatch, where the model writes as if the procedural posture were different from the actual one; and context drift, where a long-form output gradually shifts away from the constraints established at the beginning. Zhang et al. classify these as input-conflicting and context-conflicting hallucination, which are distinct from fact-conflicting errors and are not effectively addressed by the temperature and sampling constraints that drive benchmark scores (Zhang et al., 2025).

A model that scores 83% on a hallucination benchmark can still apply the federal Ashcroft v. Iqbal plausibility standard when the question calls for Illinois notice pleading under Marshall v. Burger King Corp. It can still draft a premises liability complaint that alleges dirty, discolored water on the floor when the incident report says “water on floor, source unknown.” It can still write a response to a motion to dismiss that references evidence outside the pleadings, inadvertently inviting conversion to summary judgment under Rule 12(d). These are the errors that result in sanctions, lost motions, and malpractice exposure. They are not hallucination in the benchmark sense. They are contextual errors that survive any benchmark because they look correct on the surface.

The structural case for cross-model validation

If the safety advantage of proprietary legal AI is thinner than advertised, the question becomes: what actually makes AI output safer? The answer is not a better model. It is a better verification method.

Research presented at NeurIPS 2025 by MIT researchers Kimia Hamidieh, Marzyeh Ghassemi, and colleagues, and highlighted by MIT News this week, provides the academic foundation for what practitioners have been discovering empirically. The study, “Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification,” addresses a fundamental limitation of the approach most AI systems use to evaluate their own reliability (Hamidieh et al., NeurIPS 2025 Reliable ML Workshop).

The standard method for assessing whether a language model’s output is trustworthy involves submitting the same prompt multiple times and checking whether the model generates consistent answers. This is called self-consistency, and it is intuitively appealing. If the model gives the same answer five times in a row, it seems more likely to be correct.

The MIT researchers demonstrate that this intuition is unreliable. Self-consistency measures what they call aleatoric uncertainty, which is the model’s internal confidence in its own prediction. But a model can be internally confident and still be wrong. It can return the same incorrect answer every time because the error is embedded in its weights, its training, or its fine-tuning. The researchers call this epistemic uncertainty, and they show that it requires a fundamentally different measurement: cross-model disagreement.

Their method involves comparing the target model’s response to responses from a group of similar language models trained by different organizations. Where the models agree, confidence is justified. Where they disagree, the disagreement signals that at least one model may be wrong, and the specific area of disagreement identifies where the practitioner should investigate further. The researchers combined this cross-model measure with self-consistency to create what they call a total uncertainty metric, and they found it consistently outperformed either measure alone across ten realistic tasks including question-answering, summarization, and reasoning (Hamidieh et al., 2025).

The research tested general NLP tasks, not legal applications specifically. But the principle is arguably more applicable to legal work, not less, precisely because legal reasoning involves the kind of contextual complexity, jurisdictional nuance, and interpretive ambiguity where epistemic uncertainty is highest and single-model confidence is least trustworthy.

This finding has a direct and uncomfortable implication for proprietary legal AI. By design, a proprietary platform locks you into a single model. You cannot compare its output against a competing model within the same system. You are relying entirely on the model’s self-assessed confidence, which is precisely the measure the MIT research identifies as insufficient.

In practical terms, cross-model validation is already available to any attorney with access to multiple frontier models. Run the same legal question through Claude, ChatGPT, and Gemini. Where they agree, you have corroboration. Where they disagree, you have identified exactly the point that requires primary source verification. Perplexity’s recently launched Model Council feature begins to operationalize this concept by querying multiple frontier models simultaneously and presenting the areas of agreement, disagreement, and a consensus synthesis, all in a single interface. The cost of this multi-model approach is a fraction of what a single proprietary legal AI seat costs.

The irony is pointed. The tool that charges the most provides the least structural safety. The tools that cost the least provide the most, because they allow the one thing a locked proprietary system cannot: independent verification through model disagreement.

External verification protocols: grading the test from the outside

Cross-model validation is one path to verification. The other is an external evaluation protocol that assesses the quality of AI output after generation, based on the evidence actually used in the synthesis rather than the model’s self-reported confidence.

The core insight is simple but underappreciated. When you ask a language model how confident it is in its own answer, you are asking the student to grade his own test. The model’s confidence score is a formatting choice, not a reliability signal. It reflects the patterns the model has learned about what confident answers look like, not a rigorous evaluation of whether the underlying evidence supports the conclusion.

An external verification protocol changes this dynamic. Instead of asking the model whether it thinks it is right, the protocol evaluates what the model relied on and whether those sources are appropriate for the type of question being asked. This approach begins with a recognition that not all legal questions are the same kind of question, and different kinds of questions require different kinds of evidence.

A question about the elements of a statutory cause of action is a question where binding authority is everything. It does not matter how many practitioners agree on the elements if the statute or the controlling appellate decision says something different. Authority is the only evidence that counts.

A question about how a particular judge runs her courtroom is a question where practitioner consensus is everything. There is no binding authority on judicial temperament. What matters is what lawyers who have appeared before her have experienced. A single verified practitioner account may be worth more than a treatise.

A naive verification system treats all sources on the same hierarchy. A sophisticated one adjusts its evidentiary weighting based on the type of question. It scores the quality and relevance of the sources the model actually used. It flags when the model generated citations the user did not provide. It applies hard caps to the confidence score when jurisdiction or procedural posture is unclear. It identifies what verification steps remain before the output can be relied upon. And critically, it does all of this from the outside, not from within the model’s own reasoning process.

These protocols can be built by anyone. They do not require proprietary technology. They require an understanding of legal epistemology, the discipline to implement structured evaluation, and the willingness to treat AI output as a draft that earns trust through verification rather than a finished product that arrives with trust pre-installed.

The proprietary legal AI vendors could build this kind of verification. Most have not. Their business model depends on the premise that the model itself is trustworthy. An external verification layer that highlights the model’s residual uncertainty would undermine the marketing message that their product is the safe choice.

The fear market and its expiration date

The value proposition of proprietary legal AI was built on fear. Three or four years ago, when generative AI first entered public consciousness and the sanctions began, the primary barrier to attorney adoption was fear of hallucination. Vendors identified this through market research and built their products accordingly. Lower hallucination rates. Citation verification. Stanford scores on the landing page.

That fear is a function of unfamiliarity, and unfamiliarity is declining every month. Thomson Reuters reports that active generative AI use among legal organizations jumped from 14% to 26% in a single year, with 78% of law firms believing AI will become central to their workflows within five years (Thomson Reuters, 2025 Legal Industry Report). The ACC/Everlaw GenAI Survey found that corporate legal adoption more than doubled in the same period, from 23% to 52%. Attorneys are not growing more frightened of these tools. They are growing more fluent with them.

As attorneys gain experience with these tools, their primary concerns are migrating. The initial fear was “will it make things up and will I get sanctioned.” The emerging questions are different: “how powerful is this for issue spotting and creative synthesis,” “how much of my workflow can this absorb,” and “am I getting the most capable tool available or am I paying a premium for one that has been deliberately constrained.”

When capability and utility overtake hallucination avoidance as the primary purchasing criteria, the legal AI vendors will be holding the wrong product. They built the armored car. The market is starting to ask why it handles so poorly.

The economics

A note on data security before we get to the numbers, because it will come up. The ethical obligation is to reasonably protect client confidences. Every paid tier of the major frontier models allows the practitioner to disable data training entirely. Your prompts do not surface individually in other users’ outputs regardless, but the toggle exists and it resolves the question. This is not a genuine differentiator for proprietary legal AI. It is a marketing talking point that dissolves the moment you look at the actual settings on the paid plans you are already using.

The economic case compounds the structural one. Westlaw Precision with CoCounsel costs approximately $428 per month for a single attorney (industry pricing surveys, 2026). Harvey AI is priced for enterprise deployment at rates that place it well above that figure.

The alternative: Perplexity Pro at $20 per month gives a practitioner access to frontier models with real-time search grounding and citation verification. Perplexity Max at $200 per month provides access to the most advanced models from OpenAI, Anthropic, and others, along with the recently launched Model Council feature for simultaneous cross-model validation. Even at the premium tier, the cost is less than half of a single CoCounsel seat, and the practitioner gets multi-model access, cross-validation capability, and the full unconstrained reasoning power of the frontier models.

For the attorney who prefers to manage their own tool stack, three individual frontier model subscriptions (Claude Pro, ChatGPT Plus, and Gemini Advanced) total approximately $60 to $75 per month. That gives the practitioner three independent models for cross-validation, each with internet access and document analysis capability, at roughly one-sixth the cost of a single proprietary legal AI seat.

The cost differential is not marginal. It is an order of magnitude. And the less expensive option provides more structural safety (cross-model validation), more reasoning capability (unconstrained synthesis), and more flexibility (model-agnostic workflows that are not locked to a single vendor’s architecture).

The remaining counter-argument is convenience. The legal AI makes it easy. You do not have to build your own workflow. You do not have to learn how to prompt effectively. You do not have to develop verification habits. That is true, and for some attorneys, the convenience premium is worth paying today. This article does not pretend otherwise. The verification protocols and cross-model workflows described here require expertise that most attorneys do not yet have. Building system prompts, running multi-model validation, and applying structured evaluation to AI output is not trivial.

But the convenience moat is not static. It is eroding. The expertise is learnable. The economic pressure to learn it is intensifying, particularly for solo practitioners, small firms, and contingency practices where every dollar of overhead is felt. Resources for developing these skills are increasingly available, and the attorneys who invest in methodology rather than subscription are building a capability that improves as the models improve, without vendor dependency. Convenience is a real advantage. It is not a permanent one.

What replaces it

The argument of this article is not that attorneys should abandon structured tools and rely on raw, unprompted AI output. That would be reckless. The argument is that the value in legal AI has never been in the model. It has always been in the workflow that wraps the model. Verification protocols. Structured prompting systems. Cross-model validation. Risk-tiered supervision. These are the components that make AI safe for legal practice, and none of them require a proprietary platform.

An attorney who develops a structured verification protocol that evaluates output based on the type of question, the quality of supporting evidence, and the presence of jurisdictional or factual gaps has a tool that works on any model. It is not dependent on one vendor’s fine-tuning. It is not degraded when the vendor changes its underlying architecture. It improves as the models improve, because better models produce better raw material for the same verification process to evaluate.

An attorney who runs the same question through multiple frontier models and investigates the points of disagreement is performing a more rigorous reliability check than any single model, no matter how specialized, can perform on its own work. The MIT research confirms this is not just intuition. It is a measurably superior approach to uncertainty quantification.

And an attorney who understands the failure modes, who knows that contextual errors survive benchmarks, who treats every output as a draft subject to supervision, who separates the learning phase from the drafting phase, who builds verification into the workflow rather than hoping for it after the fact: that attorney does not need a $400 per month product to be safe. That attorney needs a method. The method is transferable. It is model-agnostic. And it scales without vendor dependency.

The trajectory

None of this means proprietary legal AI will disappear overnight. Large firms with established vendor relationships and institutional inertia will continue to pay for these products for some time. Thomson Reuters and LexisNexis have distribution networks, training programs, and integration ecosystems that create real switching costs.

But the structural argument is moving in one direction. The information access advantage has narrowed to citator services and deep archival coverage, and that remaining gap shrinks as free legal databases expand. The hallucination advantage is thinner than the marketing suggests, and it comes at a cost to the capabilities attorneys are starting to prioritize. Worse for the vendors, increasing hallucination rates at the model level mean they must constrain harder to hold their scores, which means the product gets weaker precisely as the market demands it get stronger. The academic research is aligning with what practitioners have been discovering: cross-model validation and external verification are structurally superior to single-model self-assessment. And the economics are increasingly difficult to justify, particularly for the solo practitioners, small firms, and contingency practices that represent the majority of the legal market by headcount.

The legal AI vendors built their products for a market defined by fear. That was a rational response to the conditions of 2023 and 2024. But markets evolve. The fear is fading. The capabilities of the frontier models are expanding. And the attorneys who are paying attention are starting to ask a question the vendors would prefer they did not: what exactly am I paying for that I cannot get for a fraction of the price with better results?

The profession does not need more expensive wrappers around public data. It needs better methodology for supervising AI output. The wrapper is the product the vendors sell. The methodology is what actually keeps attorneys safe. One of those is a subscription. The other is a skill. And skills, once learned, do not charge monthly.

•  •  •

About the Author

 

Kenny Ferrigno practiced law for thirteen years, including extensive work in criminal defense and personal injury litigation. He now consults on AI integration for law firms, specializing in verification protocols, structured prompting systems, and workflow design. He is the author of The Skeptic’s Guide to Legal AI: The Juris Metric Protocol (2026).

 

Sources

 

Hamidieh, K., Thost, V., Gerych, W., Yurochkin, M., & Ghassemi, M. (2025). Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification. Presented at NeurIPS 2025 Reliable ML Workshop, December 2025. MIT / MIT-IBM Watson AI Lab.

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., et al. (2025). Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics, 51(4), 1373–1418. MIT Press.

Dahl, M., Magesh, V., Suzgun, M., & Ho, D.E. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Stanford HAI / Stanford RegLab. Published in Journal of Empirical Legal Studies, 2025.

OpenAI. (2025). o3 and o4-mini System Card. Released April 16, 2025.

Thomson Reuters. (2025). Legal Industry Report: AI adoption trends.

ACC/Everlaw. (2025). GenAI Survey: Corporate legal adoption metrics.

ABA Standing Committee on Ethics and Professional Responsibility. (2024). Formal Opinion 512: Generative AI Tools.

Next
Next

The Science of Prompting: Why “Persona” Is Overrated