Beyond the ambient scribe.
2026-03-20
Sarthi Editorial
The ambient scribe wave was easy to understand: it captured the note. Agentic AI is a different kind of system. It takes action — and that distinction reshapes what safety, economics, and adoption look like inside a specialty practice.
The last three years of clinical AI were, in retrospect, a narrow story. A single workflow — dictation of the clinical note — got picked up, compressed, and automated. The ambient scribe wave arrived, survived the novelty phase, and by 2025 had settled into being the second most common piece of AI in American practices, behind only voice-to-text.
That wave worked because the task was narrow and the output was passive. An ambient scribe captures what was said, renders it into a note, and hands it back for the clinician to edit and sign. Nothing happens in the world based on the scribe’s output until a human approves it. The safety profile is roughly the same as any other documentation tool.
Agentic AI is a different kind of system. It does not just describe. It acts. And the shift from describing to acting is the shift that reorganizes almost every question an operator needs to ask about AI inside a specialty practice.
What changes when the system acts
Three things change, and they change at once.
First, the unit of value changes. An ambient scribe’s value is measured per encounter: one note captured, some number of minutes saved at the keyboard. An agent’s value is measured per workflow: one prior authorization completed, one intake routed, one follow-up scheduled. The denominator gets bigger, but so does the scope of what can go wrong.
Second, the failure mode changes. A scribe that gets a word wrong produces a note with a typo; a clinician catches it on signing. An agent that gets a step wrong produces an action in the world — a wrong code submitted, a wrong study ordered, a wrong patient scheduled. The cost of a silent failure is higher, which means the observability requirement is higher.
Third, the integration surface expands. A scribe integrates with one system: the EHR note. An agent has to integrate with whatever systems the workflow touches — the EHR, the payer portals, the scheduling module, the DME or device-clinic tooling, the fax inbox, the phone tree. It is a harder engineering problem, and it lives at the seams between systems that were never designed to be addressed programmatically.
The tier model we use internally
Not all agents are created equal, and one of the more useful exercises — both for vendors and for the practices evaluating them — is to be specific about how much autonomy a given system actually has. We think about this in four tiers:
Suggestion.
The agent proposes an action but takes none. A human reviews and either commits or discards. This is roughly where most ‘AI-assisted’ coding, charting, and drafting tools sit today. Safety profile: similar to an ambient scribe.
Low-stakes execution with post-hoc review.
The agent executes actions that are easily reversible and post-hoc reviewable. Scheduling a routine follow-up. Drafting a patient message and sending it through an existing template. The expected rate of material errors is low, and the cost of a reversal is small.
High-stakes execution with pre-commit review.
The agent prepares an action but waits for a human to commit it. Prior authorization submission is the canonical case: the agent can do 95% of the work, but a practice manager or clinician presses the submit button.
Direct clinical action.
The agent makes decisions with direct clinical consequences without human review in the immediate path. This is where the current generation of systems, including ours, does not operate. The regulatory and liability surface is not settled, and the engineering around safety guarantees is early.
Most of the agentic-AI conversation in 2026 happens in tiers 2 and 3. Tier 4 is largely aspiration and marketing. The practical question for any practice is not “should we use agentic AI” but “which tier does a given workflow belong in, and does our vendor agree?”
Where agents actually earn their keep
In a specialty practice, the workflows where agentic AI changes the operational economics — rather than incrementally improving them — tend to share three characteristics.
- Multi-step and cross-system. The work requires reading one system, reasoning about it, and acting in another. Prior authorization, intake triage, device-clinic follow-up, and the revenue-cycle back office all share this shape.
- Rule-dense and payer-specific. The work depends on rules that change quarterly — payer policies, coding guidelines, documentation requirements. Humans have to hold these rules in their head, and the holding is expensive.
- Interruption-heavy. The work consists of many short, interruptible tasks rather than a small number of long tasks. The cost of context-switching for a human is high, and the value of a system that can hold the context indefinitely is correspondingly high.
Ambient scribes addressed none of these. That is not a criticism of ambient scribes — they addressed a different problem — but it is the reason the value of the next wave is measured differently. A scribe saves minutes. An agent handles a workflow.
Where the real guardrails have to live
The hard question about agentic AI in a clinical setting is not whether the model is capable. By 2026 it is clear that the underlying models can execute these workflows. The hard question is how to ensure that when the model is wrong — and it will sometimes be wrong — the wrongness is caught before it propagates into the world.
Evaluation harnesses, not vibes
The version of AI safety that consists of a product manager spot-checking a few transcripts is not sufficient for a system that takes actions. Serious agentic systems in healthcare settings ship with continuous evaluation harnesses: known-answer test cases that run against every release, outcome-based metrics that monitor the system’s behavior in production, and regression tests tied to real failure modes that were previously observed.
The most important property of an evaluation harness is that it produces a number. Not a vibe about whether the system is “working well.” A measurable rate of false-positive and false-negative actions against a fixed benchmark, tracked release-over-release.
Typed tool interfaces, not free-text
The second guardrail is at the level of the interface the agent uses to act. An agent that writes free-form text into a system is difficult to constrain. An agent that calls typed, constrained tools — “submit an auth with these fields,” “schedule a visit of type X on date Y” — is much easier to reason about. The tool type becomes the contract; anything outside the contract is rejected.
Observability at the step level
Because agentic systems execute multi-step plans, debugging them after the fact requires being able to see the full plan, not just the final output. Every tool call, every model decision, every piece of context retrieved — all of it has to be inspectable. The practice’s compliance officer should be able to reconstruct exactly what the agent did, when, and why.
Clear hand-offs to humans
The design of the hand-off points — where the agent pauses and waits for human review — is at least as important as the model’s capability. An agent that does 90% of the work and surfaces the right 10% for review at the right moment is far more useful than an agent that does 100% of the work and surfaces nothing, or one that does 50% and demands constant attention.
What this means for a practice choosing a vendor
Most of the agentic-AI marketing in healthcare today overstates autonomy and understates scope. The practical questions a practice should ask, in rough order of priority:
- Which tier? For each workflow the vendor claims to handle, get clarity on whether the system suggests, executes with post-hoc review, executes with pre-commit review, or acts directly. Get it in writing.
- Where does the hand-off live? A system that pauses at the right moment is a better system than one that runs unattended. Ask to see the hand-off, in a real environment, not a demo.
- What is the evaluation harness? Ask for the current false-positive and false-negative rates on the workflows you care about. A vendor that cannot produce these numbers does not have them.
- What does the audit log look like? Ask to see it. If the log does not let your compliance officer reconstruct a specific decision weeks after the fact, that is a material problem.
- What happens when the vendor is wrong? BAA, liability allocation, and the remediation pathway. These questions are less fun than the demo but more important.
The direction
The ambient scribe wave captured the note. The agentic wave is capturing the workflow. Both are real, and both are durable. The practices that do well in the next few years will be the ones that treat agentic AI not as a bolt-on to the existing process — a way to “save five minutes here and there” — but as a replatforming of the administrative half of the practice around a workforce that can hold more context, work more hours, and follow more rules than the one they currently employ.
The ones that do poorly will buy “agentic” systems that turn out to be Tier 1 tools in heavier packaging, pay for them at Tier 4 prices, and wonder in two years why the administrative tax on their practice did not actually fall.
The gap between the two outcomes is almost entirely a question of how carefully the practice specified what tier of autonomy it was buying, and how honestly the vendor answered.
- 1. Anthropic. (2024). "Building Effective AI Agents."
- 2. JAMA Network. (2024–2025). Ongoing coverage of ambient-documentation deployments at Permanente, Atrium, and academic medical centers.
- 3. ONC Interoperability Standards Advisory. (2025). Notes on tool-use interfaces in clinical AI systems.
See an agentic workflow running on a practice like yours.
We’ll walk through how Sarthi handles a prior authorization end-to-end — the tool calls, the hand-off points, the audit log — on a workflow and payer mix that looks like yours.
Book a walkthrough