Insights

The Design Gap Behind Legal’s AI Token Price Problem

Nimal Hemelge
June 12, 2026

The legal market has discovered a new AI problem: the bill. AI consumption costs — the charges that rack up every time an AI system reads, reasons, retrieves context and generates output — are becoming harder to predict and govern. Per seat AI billing has given way to a new reality where enterprises are paying per prompt. As model quality improves and agentic workflows expand, consumption can spike even when per-token prices hold steady, and usage quickly compounds. The COO of Uber recently admitted they had burned through their $3.4 billion R&D AI budget for 2026 by April, while an unnamed enterprise reportedly accumulated $500 million in Claude AI charges in a single month. With consumption accelerating (over 80% of legal teams now have broad AI access), law firms and legal teams are starting to ask: how much is this going to cost us when usage grows? 

The responses being debated include routing tasks to cheaper models, fine-tuning open-source alternatives, or alternative pricing models. Some firms are reportedly considering building their own models entirely. These are all sensible responses to an expensive input. 

But they all address what tokens cost. Very few people are asking why the systems are consuming so many in the first place. If tokens are the oil of legal AI, why is the conversation solely focused on the price of oil and not with the fuel efficiency of the vehicle? 

The problem is not just the price. It is the workflow. 

 Here is an example from a real client conversation. A lawyer at a large organization described their workflow: whenever a contract comes in, they upload the document, paste in the full playbook, add context notes, and submit everything to their AI tool. The tool generates a comprehensive review. They work through the output and refine it. 

They do this every single time. For every contract. Including the ones where the only change is a date, a company name, and a signature block. 

What they have built is a system that generates a similar volume of tokens on a trivial amendment as it does on a 50-page master services agreement requiring genuine legal judgment. There is no triage. There is no routing. There is no moment where someone — or something — asks: does this actually need to go through the AI at all? 

That is not a token cost problem. That is a workflow design problem. Sending every piece of work to the most expensive resource available, regardless of whether that resource is the best fit, is a form of waste.  It consumes budget that could fund better- quality work on the things that actually need it.    

What good system design actually looks like 

A well-designed workflow starts with a question most tools skip entirely: what does this piece of work actually require? 

Some requests need no AI at all. A document where only the counterparty name and effective date have changed can be handled by a trained reviewer in minutes without touching a model. Others need a fast, focused pass against a standard playbook, a risk flag if something material has shifted, and a human check on the output. Only a smaller subset warrants deep reasoning, full context, and extended token use: novel positions, non-standard terms, commercial tension that requires weighing competing priorities. These are not the majority of what comes through the door. 

The organizations struggling with AI spend are not necessarily doing so because individual tokens are expensive. They are doing so because their workflows consume tokens indiscriminately: every piece of work, regardless of what it actually needs, gets treated the same way. 

Designing AI-enabled workflows for efficiency 

Systems designed for efficiency build this judgment into triage — before anything is submitted — rather than defaulting everything to the highest-cost configuration. And architecture needs active management as models evolve: token consumption on the same workflow can shift significantly when underlying models change. This is not a one-time engineering decision. It is an ongoing operational discipline. 

The structural name for this approach is disaggregation: breaking legal work into its component tasks and ensuring each one is handled by the right person, in the right location, using the right technology. Not every task needs AI. Some need AI with light human review. A smaller subset needs AI plus expert validation. A smaller subset still should never enter an automated pipeline at all. The discipline is in the mapping: asking, before any task is submitted, what it actually requires. A team that applies this thinking does not just consume fewer tokens. It produces better work, because scarce human judgment is applied where it creates value, not spread across everything by default. 

The real question for legal teams 

Token consumption will continue to rise as AI usage scales and model capabilities expand. But the organizations that manage this well will not simply be those that access cheaper models. They will be the ones that designed their systems to avoid wasting tokens on work that didn't need them. 

Before asking what tokens cost, ask what your system is spending them on. Are you routing low-complexity work through high-cost pipelines? Are you loading the same full context into every request regardless of what the task requires? Is anyone — human or system — making an intelligent decision before the AI starts work? 

A legal AI system that churns tokens without discipline is not just a drain on resources, but it indicates poor governance, which is a risk problem, and an expensive one at that.