GTp 5.5 is this SPUD?

Apr 28

Written By Kenny .

GPT-5.5, Legal Reasoning, and the Self-Critique Question

OpenAI’s GPT-5.5 release came with a claim that should matter to lawyers more than almost any feature demo: stronger internal chain-of-thought reasoning, including the ability to refine its reasoning process, try different strategies, and recognize mistakes before answering. OpenAI describes GPT-5.5 as a reasoning model trained to “think before” it answers, producing a long internal chain of thought and learning through training to refine that process. That is not the same as giving the user a fully transparent transcript of the model’s reasoning. But it does describe a more capable internal reasoning process, and that is where the legal significance begins.

Most public coverage of GPT-5.5 has focused on coding, computer use, agentic workflows, and benchmark leadership. That is understandable. Those are the communities stress-testing frontier models most aggressively. But for lawyers, the most important question is narrower and more consequential.

Can the model catch a bad legal step before it poisons the rest of the analysis?

Why this matters in law

Legal reasoning is not a single inference. It is a chain. Perhaps as much as any form of reasoning can be.

Duty, breach, causation, damages.

Jurisdiction, statutory trigger, exception, carve-out.

Controlling authority, distinguishable authority, persuasive authority, dicta.

Each step conditions the next. A wrong conclusion early in the chain does not stay isolated. It changes what the model looks for, how it reads the next authority, how it frames the facts, and how confidently it states the final answer.

That is why legal AI failure is different from many other knowledge-work failures. A bad restaurant recommendation is inconvenient. A bad spreadsheet formula may be expensive. But a hallucinated case, a missed statutory exception, or a false statement about controlling authority can create client harm, sanctions exposure, malpractice risk, and professional-discipline concerns.

A model that simply reasons forward more fluently is useful.

A model that can reason forward, detect that an earlier step has become unstable, and revise the conclusion before producing the final answer would be something else entirely.

That is the promise. The evidence is promising. It is not yet complete.

What the public evidence supports

Artificial Analysis currently ranks GPT-5.5 at the top of its Intelligence Index. It also reports that GPT-5.5 made its largest gains on AA-Omniscience, a benchmark designed to reward factual knowledge while penalizing hallucination. On that benchmark, GPT-5.5 reached the highest reported accuracy score, 57%, and gained 14 points over GPT-5.4.

OpenAI’s own system card also reports factuality improvement, but the details matter. GPT-5.5’s individual claims were 23% more likely to be factually correct than GPT-5.4’s, while full responses contained factual errors only 3% less often. OpenAI explains the gap by noting that GPT-5.5 tends to make more factual claims per response. That is a crucial distinction for legal users. More knowledgeable answers can still create more surfaces for error.

The broader benchmark picture is strong. OpenAI reports GPT-5.5 at 84.9% on GDPval, 84.4% on BrowseComp, 35.4% on FrontierMath Tier 4, and higher scores than GPT-5.4 across several professional, tool-use, and academic evaluations. Those are not legal benchmarks, but they are relevant because legal work depends heavily on long-context comprehension, multi-step reasoning, information synthesis, and tool use.

The legal-domain evidence is also encouraging. Harvey’s April 23, 2026 research preview reported that GPT-5.5 scored 91.7% on BigLaw Bench, up from GPT-5.4’s 91.0%. Harvey also reported perfect scores on 43% of tasks, 87% of tasks above 0.80, and no task scores below 0.50. Its evaluators highlighted improvements in legal reasoning, organization, and audience calibration.

Clio’s April 24, 2026 evaluation is more product-specific, but still useful. Clio reported that GPT-5.5 achieved the strongest performance it had recorded inside Clio’s AI system, with an 87.2% overall score. It also reported roughly a 20% relative improvement on legal-research tasks requiring controlling authority and roughly a 7% relative improvement on difficult document-analysis scenarios. Clio’s framing is important because it evaluates the model inside a grounded legal system, not as a free-floating chatbot.

Taken together, the public evidence supports a strong claim:

GPT-5.5 appears to be the most capable publicly documented model currently available for complex legal reasoning workflows.

That is not the same as saying lawyers can trust it unsupervised. It is not the same as saying hallucination is solved. And it is not the same as saying the self-critique mechanism has been validated on adversarial legal tasks.

What complicates the claim

The first complication is hallucination.

Artificial Analysis reports that GPT-5.5 leads on AA-Omniscience accuracy, but also reports an 86% hallucination rate on the same benchmark. That is worse than Claude Opus 4.7 at 36% and Gemini 3.1 Pro Preview at 50%. Artificial Analysis explains that GPT-5.5 appears more likely to answer when it does not know the answer. In legal work, that is exactly the behavior that requires disciplined verification.

So the accurate takeaway is not “GPT-5.5 hallucinates less.”

The accurate takeaway is more subtle:

GPT-5.5 appears to know more and reason better, but it still needs external grounding, source checking, and lawyer review for authority-dependent work.

The second complication is monitorability. OpenAI’s system card documents generally high chain-of-thought monitorability, but also reports narrow regressions in health-query evaluations. OpenAI describes one regression as a genuine monitorability regression due to weaker monitor performance, and another as a genuine monitorability regression caused by lower agent faithfulness. Those are not legal tasks, but they matter because they show that chain-of-thought monitoring is not uniformly improved across every domain and task type.

The third complication is that chain-of-thought behavior is task-sensitive. A March 2026 paper, “Reasoning Models Struggle to Control their Chains of Thought,” found that reasoning models have much lower chain-of-thought controllability than output controllability, and that controllability changes with model size, training method, test-time compute, and problem difficulty. The authors are cautiously optimistic about current monitorability, but the broader point is still important for lawyers: general reasoning behavior does not automatically transfer cleanly to every specialized domain.

The fourth complication is doctrinal recognition.

Self-critique can only catch an error the model recognizes as an error. If the model does not know that a statutory exception matters, does not understand that a case is dicta rather than holding, or fails to identify the governing procedural posture, there may be nothing for the self-critique loop to catch. It can revise reasoning. It cannot magically supply missing doctrine unless the relevant law is retrieved, understood, and incorporated.

That is why legal AI still needs grounding.

A stronger model helps.

A better retrieval layer helps.

A lawyer still has to verify the authority.

What no public evaluation has shown yet

The missing test is not another general benchmark.

The missing test is adversarial and legal-specific.

A useful legal evaluation would do something like this:

Give the model a plausible but misleading premise.

Let it reason forward for several steps.

Introduce contrary controlling authority later in the task.

Then measure whether the model actually revises its earlier conclusion, or merely rationalizes around the contradiction.

That is the legal version of the self-critique question. It is not enough to ask whether GPT-5.5 performs better on legal tasks generally. The sharper question is whether it can detect that an earlier legal conclusion has become untenable and then rebuild the analysis from the corrected premise.

I have not found a public legal evaluation that isolates that behavior.

Harvey’s and Clio’s results are valuable, but they do not appear to test this specific adversarial mid-chain revision problem. They show better legal-task performance. They do not yet prove the deeper self-correction claim in the legal domain.

What lawyers should actually do with GPT-5.5

GPT-5.5 should not be treated as a replacement for legal judgment.

It should be treated as a stronger reasoning engine inside a controlled workflow.

That distinction matters.

For legal brainstorming, GPT-5.5 may be the best model currently available. It can map issues, test theories, compare arguments, identify missing facts, and surface alternative legal framings with more persistence than prior models.

For document-heavy work, it appears meaningfully stronger. Clio’s reported gains on controlling-authority identification and document-analysis completeness line up with what lawyers actually need from AI: not just a plausible answer, but the ability to hold more context, connect more provisions, and notice more qualifications.

For authority-dependent legal work, however, GPT-5.5 still needs verification. Every cited case must be checked. Every quotation must be confirmed. Every procedural rule must be validated against the governing jurisdiction and date. Every conclusion that depends on controlling authority should be tied to the actual retrieved source.

The model may be better at reasoning.

That does not make it self-verifying.

The real conclusion

GPT-5.5 is probably the most capable public model to date for complex legal reasoning work. The public benchmark record points in the same direction: stronger general reasoning, stronger knowledge-work performance, improved factuality at the claim level, strong legal-domain preview results from Harvey, and strong grounded-system results from Clio.

But the caution is just as important as the promise.

The self-critique capability that matters most to lawyers has not yet been proven in the way lawyers should care about. The question is not whether GPT-5.5 can generate a better answer. The question is whether it can abandon a legally wrong path after it has already started down it.

That test still needs to be run.

Until then, the best use of GPT-5.5 is not blind trust.

It is disciplined collaboration:

Use it for harder reasoning.

Ground it in real documents and retrieved authority.

Force it to expose assumptions.

Test the weak links.

Verify every authority-dependent claim.

The benchmark story is promising.

The legal verification burden remains.

Kenny .

GTp 5.5 is this SPUD?

When the Model Reads the Room