The Tunable Metric
Legal AI Vendors Won't Tell You Which Model You're Paying For. Here's Why That Matters.
Why Legal AI Benchmarks Measure Caution, Not Competence
Kenny Ferrigno
Introduction
I am an advocate of artificial intelligence in legal practice. Not a reluctant one. I consult with law firms on AI integration, I wrote a book on the subject, and I believe AI makes lawyers better at their jobs in measurable, demonstrable ways. It surfaces relevant authority in minutes that would otherwise take hours. It forces attorneys to articulate their reasoning with a precision that sloppy thinking cannot survive. When a lawyer engages with AI as a research and reasoning partner, they sharpen skills that traditional workflows let atrophy. I believe AI belongs in every modern legal workflow.
That is precisely why what I see in the current market for dedicated legal AI platforms concerns me.
I am not talking about frontier models, the general-purpose large language models built by OpenAI, Anthropic, Google, and others. Those are the tools I have spent the last several years developing procedures and protocols around. They are increasingly capable legal reasoning tools in the hands of a competent attorney who understands their strengths and limitations. I am talking about the growing market of proprietary legal AI products, typically priced at $500 or more per seat per month, that promise attorneys a purpose-built, legally optimized AI experience. These platforms market themselves on two core claims: superior accuracy through reduced hallucination, and access to exclusive proprietary legal content that frontier models lack.
I have a thesis about both of these claims. It is a thesis, not a certainty, and I want to be transparent about what is established by evidence and what is my informed inference from that evidence. But I have tested both claims against the available research, and neither holds up as well as the marketing suggests.
The thesis is this. The hallucination benchmarks that legal AI vendors use to demonstrate their value are tunable metrics. They do not measure how well a model reasons about law. They measure how aggressively a vendor has constrained the model’s output, using inference parameters that any competent engineer can adjust without retraining the model, without improving its reasoning, and without the customer ever knowing the difference. The economic incentives point overwhelmingly toward running cheaper, less capable models behind a branded interface, and the vendors’ refusal to disclose which models they use or to submit to independent reasoning benchmarks is consistent with exactly that arrangement. Meanwhile, the proprietary legal content that forms the other half of the value proposition is less exclusive than the pricing implies, because a century of legal scholarship has already made its way into the public legal record that frontier models are trained on.
Those are strong claims. Let me show you the evidence.
Part One: The Tunable Metric
Hallucination is not accuracy
The legal AI industry has coalesced around a single metric as its primary proof of value: hallucination reduction. Platform after platform advertises its hallucination rate, or more precisely the inverse, as evidence that its product is reliable enough for legal work. On its face, this sounds reasonable. Hallucinations are a legitimate concern. A model that fabricates a case citation or invents a holding that does not exist poses an obvious risk to any attorney who relies on it uncritically.
But reducing hallucinations and producing accurate, well-reasoned legal analysis are not the same thing, and the legal AI industry has built its entire value proposition on the conflation of the two.
A hallucination-free response can still be wrong. A model can cite a real case, quote a real holding, and apply it to the wrong set of facts. It can miss the controlling authority entirely in favor of a superficially relevant but distinguishable opinion. It can produce an answer that is factually grounded yet analytically shallow, the legal equivalent of a correct citation in a losing argument. Accuracy in legal reasoning is a function of selection, application, synthesis, and judgment. No current hallucination benchmark measures any of that.
The empirical record is not kind to these platforms. A preregistered study published in the Journal of Empirical Legal Studies, led by researchers at Stanford’s RegLab and Human-Centered Artificial Intelligence center, evaluated the hallucination rates of leading legal AI platforms including Lexis+ AI, Westlaw’s AI-Assisted Research, and Ask Practical Law AI.¹ The study found that these tools hallucinated in response to 17% to 33% of benchmarking queries. These are products marketed as purpose-built for legal reliability, priced at a premium, and they still fabricated or misattributed legal authority in as many as one out of every three queries. If a junior associate produced work product with that error rate, they would be placed on a performance improvement plan.
When researchers measured actual accuracy, not just whether citations were real but whether the legal answers were correct, the results were worse. A February 2026 study presented at CSLAW ’26 benchmarked Westlaw AI and Lexis+ AI against the LaborBench dataset of U.S. state unemployment insurance requirements, a domain where both platforms specifically market their multi-jurisdictional survey capabilities.² Westlaw AI achieved 58% accuracy. Lexis+ AI achieved 64%. A standard retrieval-augmented generation baseline from earlier work on the same dataset had achieved 70%, and a purpose-built open-source statutory research tool called STARA reached 83%, rising to 92% when the researchers corrected for errors in the ground truth compiled by Department of Labor attorneys themselves. The commercial platforms were outperformed not only by a custom academic tool but by a generic baseline.
These are not cherry-picked results. These are the only independent, peer-reviewed accuracy measurements that exist for these platforms, because the vendors have declined to produce any of their own.
How benchmarks are tuned
Here is the part of this that I think most attorneys do not understand, and that the vendors have no incentive to explain.
When a legal AI platform reports a low hallucination rate, most attorneys assume this reflects a better model, one that has been trained more carefully on legal data and that understands law more deeply. That assumption is reasonable but wrong. The hallucination rate of any language model can be reduced without improving the model at all. It can be reduced by adjusting a small number of inference parameters that govern how the model selects its output, parameters that require no retraining, no additional legal data, and no engineering breakthrough. Just a configuration change.
The three parameters that matter most are temperature, top-p, and top-k. Each controls a different dimension of the same underlying mechanism: how broadly the model is allowed to explore the space of possible responses before committing to one.
Temperature governs the overall distribution of probability across possible next words. At a high temperature, the model considers a wider range of candidates, including less obvious ones. At a low temperature, the distribution narrows and the model defaults to the most statistically likely token at every step. Top-p, sometimes called nucleus sampling, sets a cumulative probability threshold. If top-p is set to 0.9, the model only considers tokens whose combined probability makes up 90% of the distribution, cutting off the long tail of less likely options. Top-k is simpler: it limits the model to the k most probable tokens at each step, regardless of their cumulative probability.
In moderation, all three of these parameters improve output quality. A moderate top-p focuses the model’s reasoning without flattening it. A reasonable top-k reduces noise without eliminating signal. A carefully chosen temperature balances predictability and analytical range. These are legitimate engineering tools.
But when all three are set conservatively, when the temperature is dropped near zero, the top-p is tightened, and the top-k is restricted, the cumulative effect is a model operating at its narrowest possible range. The output becomes predictable, safe, and shallow. It reads like a textbook summary. It states the obvious rule and applies it in the most straightforward way available. It does not surprise you. It does not reframe the question. It does not make an argument you had not considered.
And, critically, it hallucinates less. Because hallucination and analytical creativity share a common mechanism. Both are products of variability in the model’s output. When the model explores the probability space and lands on something useful, something that synthesizes authority in a novel way or frames an argument with particular force, we call that quality legal analysis. When it explores that same space and lands on something wrong, a fabricated citation or a misattributed holding, we call it a hallucination. The difference is not in the mechanism. It is in the outcome. This relationship between output variability and both creative quality and hallucination risk is documented in the natural language generation literature.³
The implication is significant. A legal AI vendor can reduce its hallucination rate by tightening these three parameters, report the improved benchmark to its customers, and market the result as a better product. The model did not get smarter. The model did not learn more law. The model got more cautious. And there is no way for the customer to tell the difference, because the vendor does not disclose its inference settings.
The economics of the hidden engine
Every major legal AI platform on the market today is built on top of a foundation model developed by someone else. No legal technology vendor has built a large language model from the ground up. They begin with a frontier model created by OpenAI, Anthropic, Google, or one of a small number of other foundational AI companies. They add proprietary training data, fine-tune for legal tasks, layer retrieval-augmented generation on top, and wrap the whole thing in a branded interface.
This is not a criticism. It is how the industry works. But it creates an economic dynamic that attorneys should understand.
Not all frontier models are equal. The reasoning capability, depth of analysis, and reliability of output varies enormously across models and across tiers within the same provider’s product line. As of early 2026, Anthropic charges API customers $5.00 per million input tokens and $25.00 per million output tokens for its flagship Opus 4.6 model. Its budget-tier Haiku 4.5 costs $1.00 and $5.00 respectively, one-fifth the price.⁴ On the OpenAI side, the spread is wider. The o3-pro reasoning model costs $20.00 per million input tokens and $80.00 per million output tokens. The o4-mini costs $1.10 and $4.40, roughly one-eighteenth the cost.⁵
For a legal AI vendor processing thousands of queries per day across hundreds or thousands of seats, the difference between routing to a flagship model and routing to a budget model is not a rounding error. It is the difference between a viable margin and an unsustainable one. A vendor charging $500 per month per seat has every financial incentive to route queries to the cheapest model that produces a passable answer, not the most capable model that produces the best one.
Consider the cost comparison from the practitioner’s perspective. A solo attorney paying $500 per month for a single seat on a legal AI platform spends $6,000 per year. A Claude Pro subscription costs $20 per month, $240 per year, and provides direct access to Anthropic’s flagship model. Even at full API pricing, running several hundred substantive research queries per month against the most capable model available would cost a small fraction of the platform price. The arithmetic is not close.
The platforms do not disclose which model sits behind their API calls. They do not tell you whether every query hits the same model or whether queries are routed dynamically based on complexity, cost, or server load. You are paying a premium and trusting, entirely on faith, that the vendor is delivering a premium product behind the curtain. No competitive or reputational pressure exists to do otherwise, because the customer never sees the engine.
The transparency problem
This is not standard practice across the AI industry. Perplexity, for example, discloses to the user which model is processing their query in real time. A user can see whether their question was handled by Claude, GPT, or another model, and can make informed judgments about the output accordingly. This is not a radical act of transparency. It is a basic one. No major legal AI vendor offers anything comparable.
The irony should not be lost on this profession. Lawyers are trained, and ethically obligated, to cite their sources. An attorney who submitted a brief supported by anonymous authority would face sanctions. A law firm that assigned client work to an unnamed associate and refused to disclose who performed the work would face serious questions about accountability and competence. Yet this is precisely the arrangement that legal AI vendors have normalized.
Consider the analogy directly. You are a senior partner at a firm. You send a research task to your associate pool. The answer comes back, but you are not told which associate did the work. You know that the pool contains associates of varying capability: some are highly analytical, some are adequate, some are still learning. The quality of the work product depends entirely on who performed it, and your professional obligation to the client depends on your ability to evaluate that quality. No competent partner would accept that arrangement.
The refusal to submit to independent reasoning benchmarks fits the same pattern. The most widely cited academic benchmark for legal reasoning, LegalBench, encompasses 162 tasks across six categories including issue-spotting, rule application, and interpretation.⁶ The commercial legal AI vendors have not submitted their platforms to it. LexisNexis withdrew from the Vals AI benchmark study, the most comprehensive independent evaluation of legal AI tools published to date, for all tasks except legal research.⁷ When the most prominent independent effort to evaluate these tools on real legal tasks gives a vendor the opportunity to participate, and the vendor walks away from five out of six categories, that is not a neutral data point.
Put it together. The only metric these vendors are willing to be measured on is the one they can tune for by adjusting inference parameters. The metrics that would measure actual reasoning quality are the ones they avoid. The underlying model is hidden. The parameters are hidden. The routing logic is hidden. And the customer is asked to pay $500 a month and trust the brand.
I am not saying I know with certainty what is behind the curtain. I am saying that every piece of available evidence points in the same direction, and the vendors have the ability to prove me wrong at any time by simply showing their work. They have not done so.
Part Two: The Corpus That Already Escaped
If Part One has done its job, the reader is now asking the obvious question: what about the proprietary content?
This is the strongest card in the legal AI vendor’s hand. Whatever you think about the model under the hood, whatever you think about the inference parameters, these platforms offer something frontier models do not: access to a curated, exclusive corpus of legal materials including treatises, practice guides, editorial commentary, headnotes, key number classifications, and annotated case summaries built over decades by legal publishers. The argument is that without this proprietary layer, your AI is working with an incomplete picture of the law. Pay the premium, and you get the full picture.
It is a reasonable argument. It is also, I believe, overstated in ways that matter for the purchasing decision. What follows is an argument I have not seen made elsewhere, and I want to be transparent about what it is: a thesis built on established scholarship and observable facts, but one that has not yet been empirically tested in the specific context of AI training data. I believe the logic is sound. I invite the challenge.
The law is already public
Start with primary law, because the distinction matters. Every binding legal authority in the United States, every published federal and state court opinion, every statute, every regulation, is a matter of public record. It is available at scale through open sources. The Caselaw Access Project digitized over 6.9 million decisions from every volume of the published National Reporter System, spanning the entire history of American case law, and as of 2024 that dataset is freely available without access restrictions.⁸ CourtListener, operated by the Free Law Project, provides free access to over nine million legal decisions from more than 2,000 courts, covering over 99% of all precedential U.S. case law.⁹ Google Scholar’s case law database covers all U.S. Supreme Court opinions, all published federal court decisions since 1923, and all state supreme and appellate court decisions since 1950.¹⁰ The law itself is not behind a paywall. It never was.
What these vendors actually offer on top of the public record is secondary source material. Treatises, practice commentaries, headnotes, editorial summaries, key number digests. These are interpretive, analytical, and organizational tools built on top of primary law. They are often excellent. Generations of attorneys have relied on them to orient their research and identify relevant authority efficiently. I do not dispute their practical value.
But they are not the law. A treatise cannot be cited as controlling authority in a brief. A headnote is not a holding. An editorial comment on a ruling is not binding on any court. These materials inform the attorney’s understanding. They do not constitute the legal authority on which arguments are built or decisions are rendered.
The paradigm that escaped the paywall
Here is the point I think the vendors have not reckoned with, and it is the one that matters most for the question of whether their proprietary content justifies the premium.
Robert C. Berring, the Walter Perry Johnson Professor of Law at UC Berkeley and arguably the most influential scholar of legal information systems in the past half century, spent much of his career making a specific argument about the West Key Number System.¹¹ His claim was not merely that West’s digest system helped lawyers find cases. His claim was that it provided, in his words, “a paradigm for thinking about the law itself.” The Key Number System did not just organize cases into topics. It organized the profession’s understanding of what the topics were, which doctrinal distinctions mattered, and how legal issues related to each other. Richard Danner, in a 2014 article examining Berring’s thesis, traced how the West digest system became so deeply embedded in American legal practice that it shaped not just how lawyers researched but how they conceptualized legal problems.¹²
This matters for the AI question because of what happened over the century that followed.
Lawyers used these editorial tools to find cases, frame arguments, and structure their understanding of doctrine. They then wrote briefs using the frameworks those tools provided. Judges read those briefs and, when the reasoning was persuasive, adopted those frameworks in their opinions. Law professors analyzed the resulting doctrine and published their analyses in law reviews. CLE presenters distilled it all for the next generation of practitioners. At every step, the intellectual substance of the proprietary editorial system, the analytical frameworks, the doctrinal groupings, the way issues were framed and related to one another, propagated outward into the public legal record.
That propagation was not accidental. It was not leakage. It was the ordinary operation of a profession that thinks by writing. A frontier model trained on judicial opinions, appellate briefs, law review articles, bar journal publications, and practitioner commentary, all of which are part of datasets like the Pile of Law and the legal subsets of Common Crawl,¹³ is trained on text that was already organized by the system the vendors claim as their exclusive advantage.
The model does not need the headnote. It has the profession that was built by the headnote.
What the law itself says about this
The law recognized a version of this distinction in Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence.¹⁴ The court addressed whether copying West’s headnotes to train a competing AI system constituted fair use. Judge Bibas held that it did not. But the basis of the holding is instructive for the argument I am making. What the court protected was the specific creative expression contained in the headnotes, the particular words West’s attorney-editors chose to summarize each point of law. What the court did not and could not protect was the underlying legal principles those headnotes described. The ideas, the doctrinal frameworks, the analytical structures are not copyrightable. They belong to the profession.
Ross got in trouble for copying the headnotes directly into its training data. Nothing in the holding suggests any problem with a model that is trained on millions of briefs and opinions whose reasoning was shaped by a century of headnote-driven research. The expression is proprietary. The ideas escaped long ago.
Where the argument is strongest and where it is not
I want to be honest about the limits of this thesis, because I think intellectual honesty is what separates a good argument from a sales pitch.
The argument is strongest in heavily litigated, well-established areas of law. In torts, contracts, civil procedure, constitutional law, criminal law, and other core doctrinal areas, there are decades of opinions, briefs, and scholarship reflecting the analytical frameworks that originated in proprietary editorial work. The signal is strong. The propagation has been thorough. A frontier model trained on this material has, for practical purposes, absorbed the substance of those frameworks even without access to the source material.
The argument is weaker for emerging doctrines, niche practice areas, and jurisdictions with thin case law. Where the primary record is sparse, a well-curated treatise or practice guide provides genuine incremental value precisely because the analytical framework has not yet had time to propagate through judicial opinions and practitioner writing. I am not dismissing that value. For a practitioner in a narrow specialty where the secondary source material is the most current and comprehensive analysis available, access to that material may be worth paying for.
A sophisticated critic might also argue that each step of propagation involves transformation, and that the analytical precision of the original headnote degrades as it passes through briefs, opinions, and commentary. That is a fair theoretical concern. But it must be weighed against an observable fact: frontier models already produce competent legal analysis in core practice areas without access to proprietary secondary sources. If the propagation had degraded the signal to the point of uselessness, that would not be possible.
I should also note what I am not able to establish empirically. No one has yet conducted a controlled comparison of legal AI output produced with access to proprietary secondary sources against output produced without that access, holding all other variables constant. That study would be valuable, and I would welcome it. If it showed that proprietary content produces meaningfully better outcomes on legal reasoning tasks, I would update my position accordingly. But the absence of that study cuts both ways. The vendors who claim that their proprietary content justifies a $500-per-month premium have not produced that evidence either.
What I can say is this. The pricing model for these platforms is built on the premise that without their proprietary content, your AI is working with a fundamentally incomplete picture of the law. The evidence and the logic suggest that the premise is overstated. The weight of the intellectual contribution made by a century of proprietary legal scholarship already lives in the public legal record. The pricing implies an exclusivity that the history of the profession has already eroded.
What Would Change My Mind
I want to be clear about what this article is. It is not a rejection of AI in law. It is not a rejection of dedicated legal AI platforms as a concept. If a vendor built a platform that ran a flagship reasoning model, disclosed its architecture, submitted to independent evaluation of reasoning quality, and justified its pricing against that demonstrated quality, I would recommend it enthusiastically. I want that product to exist.
What I cannot do, given everything outlined above, is point to any of the current platforms and say, without reservation, that the premium is justified by the product. The evidence available to me does not support that conclusion. The evidence the vendors could provide to change my mind has not been offered.
Here is what would change it.
Disclose the base model. Tell your customers whether their queries are being processed by a flagship reasoning model or a budget-tier alternative. If the model changes, say so. If queries are routed dynamically to different models based on cost or complexity, publish the routing logic. Attorneys evaluate the source of the information they rely on. Give them the ability to do that with your product.
Submit to independent evaluation of reasoning quality, not just hallucination rates. The legal profession does not measure competence by counting how often an attorney avoids being wrong. It measures competence by the quality of the analysis, the strength of the argument, the ability to synthesize authority into something useful. If your platform cannot withstand that standard of review, your benchmark numbers are decoration. And if you participate in independent evaluations, participate fully. Do not withdraw from five out of six categories and then market on the strength of the one you stayed in.
Publish your inference parameters, or at minimum acknowledge publicly that hallucination benchmarks can be influenced by parameter settings that do not reflect underlying model quality. The legal profession does not tolerate misleading metrics from expert witnesses. It should not tolerate them from the tools it pays to use.
Demonstrate the content premium empirically. Conduct or commission a controlled study comparing legal reasoning output produced with access to proprietary secondary sources against output produced without it. If the proprietary content produces meaningfully better outcomes, publish the results and justify the pricing on that basis. If the study has not been done, the premium is being charged on an untested assumption.
None of these are unreasonable demands. Every one reflects a standard that already exists within the legal profession itself. Cite your sources. Submit your work to scrutiny. Justify your fees. Be honest about limitations. Attorneys are held to these standards every day. The tools they pay for should be held to them too.
I will be the first to update this recommendation when a vendor meets that bar. I am not hard to convince. I just require evidence. That should sound familiar to anyone who practices law.
Notes and Sources
¹ Varun Magesh et al., “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools,” 22 Journal of Empirical Legal Studies 216 (2025). Originally published as a preprint by Stanford RegLab and HAI (arXiv:2405.20362, May 2024). The study tested Lexis+ AI, Westlaw AI-Assisted Research, Ask Practical Law AI, and GPT-4 using a preregistered benchmark of legal queries, finding hallucination rates between 17% and 33% for the commercial tools.
² Mohamed Afane et al., “Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys,” CSLAW ’26, March 3–5, 2026, Berkeley, California (arXiv:2603.03300). The study benchmarked Westlaw AI (58% accuracy), Lexis+ AI (64%), standard RAG (70%), and the Statutory Research Assistant, STARA (83%, corrected to 92%), against the LaborBench dataset of U.S. state unemployment insurance provisions.
³ Ziwei Ji et al., “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys 55, no. 12 (2023). The survey documents the relationship between output variability and both hallucination frequency and creative output quality across natural language generation systems.
⁴ Anthropic Claude API Pricing (March 2026): Opus 4.6 at $5.00/$25.00 per million input/output tokens; Haiku 4.5 at $1.00/$5.00 per million input/output tokens.
⁵ OpenAI API Pricing (March 2026): o3-pro at $20.00/$80.00 per million input/output tokens; o4-mini at $1.10/$4.40 per million input/output tokens.
⁶ Neel Guha et al., “LegalBench: A Collaboratively Built Benchmark Dataset for Legal Reasoning” (2023). The benchmark encompasses 162 tasks across six categories of legal reasoning. No major commercial legal AI vendor has submitted its platform to this evaluation.
⁷ Vals AI, “Application Reports: VLAIR 2-27-25” (February 2025). The study evaluated CoCounsel (Thomson Reuters), Vincent AI (vLex), Harvey Assistant (Harvey), and Oliver (Vecflow) across seven legal tasks. LexisNexis withdrew from all tasks except legal research.
⁸ Harvard Law School Library Innovation Lab, “Caselaw Access Project” (2024). The project digitized 6.9 million decisions from the National Reporter System. The full dataset was made available without access restrictions in 2024.
⁹ Free Law Project, “Meet CourtListener: Your New Case Law Power Tool,” HeinOnline Blog (December 2025).
¹⁰ Google Scholar case law database coverage per Google’s documentation (updated 2026).
¹¹ Robert C. Berring, Walter Perry Johnson Professor of Law, UC Berkeley School of Law. The American Association of Law Libraries named his Finding the Law the most significant contribution to law librarianship of the past fifty years (2006). Berring’s scholarship on the relationship between legal information systems and legal thought spans multiple works, including “Collapse of the Structure of the Legal Research Universe: The Imperative of Digital Information on the American Legal System,” 69 Washington Law Review (1994), and “Legal Research and the World of Thinkable Thoughts,” 2 Journal of Appellate Practice and Process 305 (2000). Berring argued that West’s American Digest System provided practicing lawyers not merely a means of locating cases but “a paradigm for thinking about the law itself.”
¹² Richard A. Danner, “Influences of the Digest Classification System: What Can We Know?,” 33 Legal Reference Services Quarterly (2014). Available at Duke Law Scholarship Repository. Danner examines Berring’s thesis about the influence of the West digest system on American legal thought and traces how the classification system became embedded in the profession’s conceptual framework during the century it dominated legal research.
¹³ Peter Henderson et al., “Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset” (2022), arXiv:2207.00220. The dataset includes court opinions, briefs, law review articles, and other legal documents drawn from the public record.
¹⁴ Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc., No. 20-613 (D. Del. Feb. 10, 2025) (Bibas, J.). The court held that Ross Intelligence’s copying of West headnotes to train a competing AI legal research system did not constitute fair use. The holding protects the specific creative expression of the headnotes but does not extend to the underlying legal principles they describe, which are not copyrightable.