
A recent Sense Collective roundtable brought together legal operations and legal innovation leaders to examine a deceptively simple assumption: human oversight of AI is required.
The discussion quickly moved past the binary question of whether humans should be “in the loop” and toward a more practical question: what kind of oversight is appropriate for which workflows, at what risk level, and with what audit model?
Factor’s 2026 GenAI in Legal Benchmarking data shows why this question is moving from theory to operating model: only 22% of legal teams report high trust in AI outputs, while ~70% say outputs still require targeted edits or extensive rework before they can be relied upon. The issue is no longer whether humans should oversee AI. It is how oversight should be designed so that review effort matches the risk of the work.
Members discussed concrete use cases, emerging frameworks for categorizing oversight levels, unauthorized practice of law concerns, and a provocative inversion: if AI demonstrably outperforms humans in certain tasks, could failing to use it eventually become the greater professional risk?
Below are key highlights from the discussion, summarized without attribution to any individual or organization.
“In the loop,” “on the loop,” and “out of the loop” can all be legitimate oversight models depending on the task. The goal is not maximum oversight. The goal is right-sized oversight.
The dominant framework that emerged was a three-part test for determining how much human involvement a workflow requires:
How often does the work recur? How much judgment does it demand? What is the liability exposure if something goes wrong?
High-volume, low-complexity, low-risk tasks were consistently named as strong candidates for reducing per-instance human review. Examples included routine NDA triage, certain RFP-related agreements, and templated conflict-of-interest reviews where historical decision patterns are highly consistent.
As complexity increases, however, members were clear that human involvement remains essential. Non-standard terms, unusual risk allocation, and strategic legal judgment all require a qualified person in the decision-making process, even if that person is validating an AI recommendation rather than doing the work from scratch.
One example discussed was NDA triage: a tool may be able to clear routine NDAs with no unusual provisions, flag a middle category for light review, and escalate the small percentage containing genuine red flags. Another example involved conflict-of-interest workflows where historical data showed that certain categories were almost always approved, suggesting that periodic audit may add more value than repetitive per-instance sign-off.
The point is not that every team should adopt the same thresholds. The point is that oversight should be calibrated to the actual risk and complexity of the workflow.
The most practical shift members described was not from human review to full autonomy. It was from human in the loop to human on the loop.
Human in the loop means a qualified person reviews every AI output before it is acted upon.
Human on the loop means the system can run autonomously within defined parameters, while a human monitors performance, audits outputs, and can intervene when needed.
Human out of the loop means the system acts without human involvement. Members treated this as appropriate only in narrow, well-tested, low-risk, high-volume circumstances.
This distinction matters because “human oversight” does not always have to mean reviewing every single output. In the right workflows, statistical sampling and periodic auditing may be a more realistic and more effective quality-control model.
Members described evolving toward audit-based approaches in which AI handles routine work, while humans validate quality over time through sampling. In one example, a team reviewed a sample of outputs on an ongoing basis rather than reviewing every instance, using the sample to confirm that the tool remained within acceptable accuracy bounds.
The key precondition is evidence. Teams with years of prior decisions logged in workflow systems are in a stronger position to identify which decisions are highly consistent, which ones require judgment, and where automation can be defended.
If humans were approving the same category of request almost every time, the human review may have been adding less value than assumed. In that context, AI plus periodic audit may create a more consistent, documented, and auditable process.
One of the most important challenges raised in the discussion was the idea of “ground truth.”
Many AI validation processes compare model output against attorney-reviewed work. But that assumes the attorney answer is always right. Members questioned that assumption.
In one validation exercise discussed, attorney reviewers were not always correct. In some cases, the AI identified the right answer and the attorney-reviewed baseline contained the error. That does not mean AI should be treated as infallible. It means humans should not be treated as infallible either.
This reframes the oversight question. The right question is not, “Is the AI perfect?” The better question is, “What level of accuracy is acceptable for this task, given its risk, and how does the AI-human system perform compared with the current human-only process?”
That distinction is important for legal teams building validation frameworks. If attorney output is treated as perfect ground truth, the validation process may mask real performance gaps, including human inconsistency, playbook interpretation differences, and fatigue-driven error.
A more mature approach is to define task-specific accuracy thresholds, test against those thresholds, audit periodically, and account honestly for both AI error and human error.
Members also raised a practical governance tension: legal teams often advise the business that AI outputs require human review, while legal teams themselves are exploring ways to reduce human review in their own workflows.
That inconsistency can become a credibility risk.
If a legal AI policy says that all AI-generated outputs must be reviewed by a human, the legal team needs a principled explanation for why exceptions inside legal are justified. Otherwise, the better answer may be to revise the policy.
One possible distinction is user population. A trained legal professional using a well-scoped tool inside a controlled workflow is not in the same position as a non-expert using a general-purpose AI tool to answer a legal question. Governance should reflect that difference.
A mature AI policy should distinguish between experts and non-experts, high-risk and low-risk workflows, internal and external use cases, and systems that are auditable versus systems that are not. One policy does not fit all users or all tasks equally.
As automation scales, the risk of external challenge grows. Several members raised unauthorized practice of law as a risk that should shape how teams think about automation and human oversight thresholds.
The concern operates on two levels.
First, for legal teams themselves, how much attorney involvement is required to say that an AI-generated output reflects attorney judgment? Is it enough for an attorney to design the workflow or prompt? Does an attorney need to review every output? Can sample-based review be sufficient? These remain open questions.
Second, for enterprise tools more broadly, employees can now prompt general-purpose AI tools and receive something that looks like legal advice. Several members described this as already difficult to control. Guardrails such as routing certain legal prompt categories to legal intake, rather than allowing the tool to answer directly, may become increasingly important.
Members noted that UPL concerns remain largely theoretical in this context. No one described a live litigation or enforcement event. But the group saw value in proactively understanding the risk rather than waiting for a challenge to emerge.
One useful reframing was that UPL rules were originally designed to protect the public from unqualified practitioners, not to protect law firms’ revenue streams. That distinction suggests that public-facing AI legal applications may carry greater UPL risk than controlled internal automation for sophisticated legal teams.
The practical implication: legal teams should understand where UPL questions may arise, document the role of attorney judgment, and design audit trails before those questions are raised externally.
Defensibility is not only a legal question. It is also an architecture question.
Members emphasized that responsible automation depends on closed-context design, clear audit trails, traceable outputs, and user education. Automated workflows should have access only to the data they need. Decisions should be explainable and reviewable. Users should understand what the system can access, what it cannot access, and when human judgment is required.
This is especially important as agentic tools become more common. The more a system can retrieve data, use tools, and take action, the more important it becomes to define boundaries, monitor behavior, and educate users.
Responsible automation is not just about deploying a tool. It is about building an operating model that can withstand scrutiny.
Perhaps the most provocative theme in the discussion was whether the risk will eventually invert.
Today, much of the concern around AI focuses on the danger of using it: hallucinations, missed issues, poor judgment, confidentiality concerns, and lack of explainability.
But several members raised the opposite question: what happens when AI reliably finds issues that humans miss?
If an AI tool can identify buried contract terms, regulatory triggers, obligation patterns, or risk signals more consistently than a human reviewer, then failing to use that tool may eventually become the greater professional risk.
The group connected this to the evolution of legal research tools. Technologies that were once optional can become part of the expected standard of practice. AI may follow a similar path.
That does not mean every AI tool should be adopted immediately or uncritically. It does mean legal teams should begin tracking where AI materially improves performance, where it remains unreliable, and when the professional baseline starts to move.
The emerging question is no longer just, “Can we justify using AI?” It is also, “In which workflows will we eventually need to justify not using it?”
The Sense Collective is Factor’s curated community for legal and innovation leaders advancing AI in legal.