AI agents in the wet lab: what actually saves time, what's hype, and where they break

The “AI in your lab” pitch comes in two extreme flavors.

On one end: fully autonomous self-driving labs — robots designing, executing, and analyzing experiments while the PI takes a vacation. On the other: a friendly chatbot that helps you draft an email about your western blot.

Neither extreme matches what works in mid-2026. The honest answer sits in between — and it’s narrower, more interesting, and more useful than either pole. This article maps that middle ground: what agents actually do well today in a wet lab, where they still fumble, and the data architecture that decides which side of that line your lab ends up on.

If you’re a PI being pitched on “AI for your lab,” this is meant to be the article you read before signing anything.

Where the hype lands vs. where the work is

Two industry signals will set your expectations badly.

Demo videos of self-driving labs. Coscientist, ChemCrow, and their successors are real systems that solve real problems — but the problems they solve are narrowly scoped: single well, single readout, well-defined chemistry. They don’t generalize to most cell-biology workflows, no matter what the press release implies.

Generic LLM chatbots branded as lab assistants. Useful for summarizing protocols you paste in. The moment you ask one to read your actual experimental data or call your actual instruments, it falls back to plausible-sounding hallucination.

The agents that earn their place in a working lab sit between these two. They:

read real instrument streams, not generic published values;
call specific tools — cell counters, image-analysis pipelines, scheduling APIs;
operate on a bounded scope — this experiment, this cell line, this reagent set;
escalate to humans when they hit safety, novelty, or cost boundaries.

Everything else in 2026 is either a research paper or a sales deck.

Five wet-lab tasks compared. Three jobs AI agents do well today: protocol drafting and adaptation, anomaly triage on instrument streams, and image-analysis pipeline composition. Two jobs they still fumble: long-horizon experimental design with branching decisions, and safety or biosafety decisions — both of which require human review before execution. — The 2026 split: three places to deploy agents now; two places they’ll embarrass you if you don’t review their output.

Three jobs agents already do well

1. Protocol drafting + adaptation

Give a well-instrumented agent a published method and your lab’s specific cell line, media formulation, and available equipment, and it will produce a serviceable first-draft SOP in seconds.

“Adapt this NIH iPSC differentiation protocol to our HUES9 line cultured in mTeSR Plus on a CultureON 100, with stocks of XYZ growth factors and an Operetta CLS for endpoint imaging.”

A well-grounded agent produces a step-by-step protocol that already accounts for your reagent concentrations, your gas mix, your timing constraints, and your readout. A junior trainee edits it in 10 minutes; drafting it from scratch takes an hour. The agent has done the boilerplate; the human has done the judgment.

This is the lowest-risk place to deploy an agent today. The output is a document, edited by a human, never executed unsupervised. Errors get caught in the editing pass.

2. Anomaly triage on instrument streams

If your incubator, perfusion pump, or microscope is streaming telemetry (see our earlier piece on real-time cell culture telemetry), an agent can watch the stream and surface deviations with context.

Not “alert me whenever CO₂ drifts out of range” — that’s a threshold rule, not an agent. The agent’s value is the layer above:

The CO₂ deviation at 02:14 am matches a door-event signature. The affected run is your iPSC expansion in well B-04. Recovery took 11 m 22 s — 40 % longer than baseline for this unit. The cells likely saw nine minutes outside the buffering range; expect elevated stress-response gene expression on the next harvest.

The agent did three things a threshold rule can’t: it classified the event (door open vs. sensor fault vs. HVAC), scoped it to the affected experiment, and quantified the likely biological consequence.

This only works if the data is queryable — which is why OMĒOS treats real-time telemetry and document records as the substrate the agent layer reads from. Agents without a data layer are confidently wrong.

3. Pipeline composition for image analysis

Given an image set and a question, an agent can chain segmentation, tracking, and classification tools — Cellpose + TrackMate + CellProfiler Analyst, or any of the alternatives we mapped in the open-source cellular image analysis field map — into a working pipeline.

What this looks like in practice:

Count mitotic events in these 96 wells over the last 24 hours, broken out by treatment group. Flag wells where the rate is more than 2σ above baseline.

The agent picks the right segmenter (Cellpose for irregular cells, StarDist for nuclei), chains it to TrackMate for lineage, runs a classifier for mitosis detection, and emits a CSV with per-well counts. A junior analyst could do this. The agent does it in four minutes instead of four hours.

The pipeline won’t be publication-grade out of the box — you’ll need to review the segmentation, validate the classifier, sanity-check edge wells. But it gets you to a draft answer fast enough to iterate, which is where most of the wall-clock time on an analysis goes anyway.

Two jobs they still fumble

1. Long-horizon experimental reasoning

Designing a three-month experiment with branching decision points based on intermediate results is not an agent task in 2026. LLMs lose context across long horizons. They can’t model biological uncertainty — cell-doubling-time variance, batch effects, contamination probability. They confidently propose decision paths that look plausible but ignore reality.

The “AI Scientist” papers — autonomous experimental design + execution + analysis — work because they’re constrained to extremely narrow problem spaces. Try to scale them to a full thesis project and they fall apart.

If you’re considering an agent for experimental design, treat it as a brainstorm partner, not a decision-maker. Use it to generate hypotheses you wouldn’t have thought of. Don’t let it pick which ones to run.

2. Safety-critical and biosafety decisions

Agents will, with high confidence, propose protocols that:

use deprecated reagents that have been off-market for years;
specify concentrations outside the working range for the cell line;
combine incompatible buffers;
violate BSL-2 containment without flagging it.

Every protocol an agent drafts must be reviewed by someone who knows the field before it touches a flask. “The agent suggested it” will not survive an IBC review. It will not survive a publication’s reproducibility check. It will not survive a wrongful-death lawsuit.

This is not a problem agents can solve themselves — they have no skin in the game. Keep humans in the loop where the cost of being wrong is high.

The architecture that separates a useful agent from a parlor trick

When you’re being pitched on an agent product, five questions decide whether the thing in front of you is a tool or a demo. We use this list internally; share it with any vendor that walks into your lab.

Five layers that separate a useful agent from a parlor trick. 1, Tool use — calls actual instruments, pipelines, and databases rather than just generating text. 2, Persistent memory — remembers across runs, days, and experiments instead of starting from scratch each session. 3, Ground truth from real instrument streams — reads actual lab data, not generic published spec sheets. 4, Verification layer — checks its own output against constraints before emitting it. 5, Explicit boundaries — refuses to act on safety-critical or novel tasks without human sign-off. — The five-question test. A product that doesn’t have answers to all five is a demo, not a deployment.

1. Tool use. Does the agent call actual instruments, pipelines, and databases — or does it just generate text? “Generates text” is a chatbot, not an agent. Frameworks to look for: MCP (Anthropic’s open Model Context Protocol), structured function-calling APIs, LangGraph-style tool nodes.

2. Persistent memory. Does the agent remember what happened in your lab last week, last month, last year — or does it start every conversation from scratch? Memory is what lets the agent recognize patterns: this is the third time we’ve seen the 02:14 am door event on this unit.

3. Ground truth from real instrument streams. Is the agent reading your actual data, or a published spec sheet it has interpolated around? An agent that hallucinates data is worse than no agent at all, because it’s confident.

4. Verification layer. Does it check its own output against known constraints before emitting it? A draft protocol should be validated against your reagent inventory, the cell line’s working ranges, and your biosafety classification before it appears in front of a human. Otherwise the human is doing the agent’s verification work — and probably missing things the agent should have caught.

5. Explicit boundaries. What does the agent refuse to do without human sign-off? A vendor that can’t answer this in detail hasn’t built the safety layer yet — they’ve shipped a demo with a robot personality.

A product without all five is a parlor trick.

What this means for your lab today

Four pieces of practical guidance, in priority order:

Don’t deploy agents without instrumentation. They need real data to be useful. An incubator that doesn’t stream is an incubator the agent will hallucinate around.
Start with bounded, verifiable tasks. Protocol drafting first — the output is a document, edited by a human. Anomaly triage second — the output is an alert, validated by a human. Don’t jump to autonomous experimentation.
Keep humans in the loop on safety, novel design, and publication. These are the categories where being wrong is expensive — financially, scientifically, or legally.
Invest in your data layer. This is the real unlock. Data infrastructure is reusable across many agent use cases; the agent itself is interchangeable. Build the substrate well and you can swap agent vendors as the field evolves — without rebuilding everything around them.

How 37degrees thinks about agents

We don’t ship an “AI agent” today. We ship the data layer the agent needs to be useful: continuous instrument streams, document records, a live database, and the cross-experiment context that makes the agent’s answers grounded instead of guessed.

Every CultureON 100 streams telemetry to OMĒOS by default. Every experiment record is linked to the instrument traces that ran on it. Every document record sits next to the data it was generated from. That’s the substrate.

When you deploy an agent on top — ours or someone else’s — it has real data to read. The five-question test above is how we evaluate every agent product partnering with us. The ones that pass have a clean tool-use interface, persistent memory, and a verification layer. The ones that don’t are demos.

This is also why the 37degrees technology stack lists AI Agents alongside HPC GPU compute, document records, live database, and social sharing as a first-class capability of OMĒOS. The agent is one consumer of the data layer. It is not the foundation, and it shouldn’t be — agents are interchangeable; the data underneath is not.

Closing: collaborators, not autopilots

The honest 2026 framing is this. Agents are excellent at compression — turning hours of work into minutes — when the task is bounded and verifiable. They are terrible at expansion — taking a half-formed hypothesis into a published result without supervision.

Treat them like a fast, occasionally hallucinating, very motivated research assistant who needs your judgment.

The labs that win in the next two to three years won’t be the ones that bet on full autonomy. They’ll be the ones that deploy bounded agents on a well-instrumented foundation, free up their scientists to do the parts of science that compress poorly — curiosity, judgment, hypothesis selection — and leave the rest to the agents.

Build the data layer first. The agents take care of themselves.

Working on lab automation and want the data infrastructure to match? Explore OMĒOS, the streaming + document-records substrate agents need to be useful — or get in touch if you’d like to compare notes on what’s actually shipping in 2026.