Morpheus — Reading the Chat

One human is texting. There should be one coach reading the thread and replying once — not N parallel agents racing each other. Every idea we have for getting there.

9 min

2026-06-18 · companion to imessage/chat-poller.mjs + src/agent.mjs · the message-batching / turn-taking problem

The problem, precisely

iMessage is bursty. A human rarely sends one tidy paragraph — they fire a stutter of bubbles as the thought assembles:

themwait

themso i just had like 3 slices

themof pizza

themis that bad lol

The naive transport — one inbound row → one agent turn → one reply — turns that into four overlapping turns. Four LLM calls, four replies, each one blind to the bubbles that came after it. The coach answers "wait" before "pizza" has even arrived. That's the multi-thread failure: the user sees a stranger answering each fragment, not one person reading the message.

We want the opposite: collapse the burst into one coherent read, and reply once — but without making a person who sends a single quick "what should I eat?" sit through a fixed dead-air delay before the coach reacts. Those two goals pull against each other. That tension is the whole design space below.

Two distinct sub-problems, often conflated. (1) Coalescing — bubbles that arrive before we start generating. (2) Barge-in — bubbles that arrive while we're generating or already mid-reply. A debounce window only solves (1). A complete solution needs both.

What we have today

Burst-settling in the poller.In chat-poller.mjs, each chat has a settle buffer. A new row resets a settleMs (currently 8s) quiet-window timer; only after the window goes silent do the buffered rows fire as ONE onMessage, text joined by newlines. This is a fixed debounce.
Per-chat serialization.An inflight map chains turns per chat so two batches never run in parallel. Good — but a batch that lands mid-turn just waits its turn; it can't fold into the running one or interrupt it.
Full history into the model.runAgent replays user.history, so the model already can see prior bubbles — the problem is purely about when we decide the user is done and how late input is handled.
Paced multi-bubble output.deliverBubbles already sends one reply as several bubbles with typing cadence — so a single batched turn still looks like natural texting. We don't need many turns to get many bubbles.

The one knob in production is the fixed 8s settle. It's the "annoying delay." Everything below is about beating a fixed delay.

The three axes

Every idea is a choice on three axes. Naming them keeps the catalog honest:

Axis	Question	Cheap end ↔ expensive end
Latency	How long until the coach reacts to a finished thought?	instant ↔ fixed dead-air
Coherence	Does one reply address the whole burst?	per-fragment ↔ whole-thread
Cost	Tokens / compute burned per human turn?	one call ↔ speculative re-runs

A fixed debounce trades latency for coherence at zero extra cost. Speculation buys latency back by spending cost. Model-in-the-loop buys accuracy by spending a small call. Pick per the value you're short on.

Signals we can read

Smart timing needs evidence that the user is (or isn't) done. What's actually available to us:

Signal	Source	Tells us
Typing indicator	chat.db / helper stream (the "…" bubble)	The strongest signal. If they're typing, they're not done. Fire the instant typing stops, not on a blind timer.
Terminal punctuation	message text	Ends with `? . !` → likely complete. Ends bare, or on `and / so / i / ,` → likely mid-thought.
Inter-bubble gap	row timestamps	Tight gaps (<2s) = a burst in progress. A long gap = thought finished.
Message shape	text length / form	A lone "?" or "ok" is complete-and-short → fire fast. A trailing fragment wants a wait.
Question vs statement	content	A direct question expects an immediate answer; rambling context can tolerate a beat.
Attachment rows	`has_attachments`	A photo often lands as its own row, with the caption a separate bubble. Almost always wait for the pair.
Read receipts / our own state	helper	Whether we've already marked read / started typing back — affects whether interrupting is graceful.

The typing indicator is the unlock. Most of the "annoying delay" problem dissolves if we stop guessing with a timer and instead watch the "…". Quick single message, no typing → fire in well under a second. Still typing → wait as long as they keep going. Everything in family A is a fallback for when typing data is unavailable or unreliable.

A · When to fire — timing & endpointing

All of these decide the single question: has the user stopped talking? They live in the poller's settle logic.

baseline

A1 · Fixed debounce live

What we run now. Reset an N-second timer on each row; fire on silence. Dead simple, robust, zero cost.

CostNone.

PainEvery message — even an obvious finished question — eats the full N seconds.

Lives in scheduleTurn / settleMs.

strong

A2 · Adaptive debounce (content-aware window)

Same machinery, but the window length is a function of the last bubble. Complete-looking text → short window (e.g. 800ms). Mid-thought text → long window (e.g. 6–8s). Replace the constant settleMs with windowFor(rows).

CostNone — pure heuristic.

WinQuick "what should I eat?" fires almost instantly; trailing "so i had…" still waits.

A completeness(text) heuristic feeding a variable timeout in scheduleTurn.

strong

A3 · Typing-indicator gating ("wait-to-talk") recommended

Subscribe to the typing indicator. Rule: never fire while the "…" is showing. When typing stops, start a short grace timer (~600–1200ms, covers the gap before the next bubble) and fire if it stays quiet. No typing seen at all → treat as done and fire on a short window.

CostNone — just another event source.

WinLatency collapses to near-zero for finished messages; long bursts are respected naturally.

RiskTyping indicators aren't 100% reliable over all transports — keep A1/A2 as a hard-cap fallback.

New typing feed in the poller → gates the settle timer.

strong

A4 · Two-tier timeout (soft + hard cap)

Two timers: a soft one that fires early when signals say "complete," and a hard ceiling (e.g. 10s) that fires no matter what, so we never hang on a flaky signal. A2/A3 ride on top of this.

WhySafety net. Makes every adaptive scheme fail safe instead of fail silent.

nice-to-have

A5 · Per-user learned cadence

Track how a given handle texts — burst-of-many vs one-long-paragraph, typical inter-bubble gap — and tune their window over time. Stored per-user alongside the profile.

WinTailors latency to the actual human instead of a global guess.

CostState + tuning complexity; marginal over A2/A3. A later optimization, not a first move.

flavor

A6 · Time-of-day / engagement awareness

Tighten the window during an active back-and-forth (recent turns, fast replies); loosen it when the chat's been cold. A cheap multiplier on the base window.

B · Speculation — hide the wait

Different bet: don't wait then generate — generate during the wait and throw the draft away if the user keeps talking. Trades tokens for latency.

high-impact

B1 · Speculative generation during the window recommended

The moment a bubble lands, kick off the agent turn speculatively against the buffer-so-far. If another bubble arrives, abort and restart with the bigger context. When the window finally closes, the draft is usually already done → send instantly, zero perceived latency.

WinBest-case latency ≈ 0 even with a conservative window — the wait overlaps generation instead of preceding it.

CostWasted tokens on every aborted draft. Bounded by burst length; cap restarts.

An AbortController on createMessage; restart on each new row; settle-fire just delivers the ready draft.

partial

B2 · Speculative + commit-on-quiet

Variant: let the speculative draft run to completion, but hold delivery behind the quiet check. If new input arrived during generation, discard and re-run (this is B1 ∪ barge-in C2). If not, ship it. The "look before you speak" gate.

expensive

B3 · Always-on draft (eager)

Regenerate a fresh draft on every row, always have a current answer ready. Lowest latency, highest waste. Only sane if turns are cheap or the model is tiny. Listed for completeness.

C · Barge-in — input that arrives during a turn

The half a debounce can't touch. The user spoke while we were thinking, or while bubbles were still landing on their phone. What do we do with the half-built reply?

core

C1 · Cancel-and-restart recommended

New input mid-generation → abort the in-flight LLM call, discard, restart with the new message folded in. Cleanest correctness: whatever we say will have seen everything. Pairs naturally with B1 (same abort path).

CostTokens already spent on the killed call.

AbortController threaded through createMessage in runAgent.

core

C2 · Queue-and-recheck before send

Let the turn finish, but in deliverBubbles check the inbound buffer before each bubble. If new input landed, stop, discard remaining bubbles, and re-run with the new message + what was already sent. The delivery loop already pauses between bubbles — natural interrupt points.

WinNo wasted in-flight call; just unsent bubbles dropped. Cheaper than C1.

RiskIf we already sent bubble 1 of 3, the re-run must own that — feed it "you already said X."

A buffer-peek between iterations of the deliverBubbles loop.

advanced

C3 · Mid-stream injection (true barge-in)

The "pass the new message in while it's generating" idea. With the standard request/response API you can't literally inject into a running call — "injection" reduces to C1 (abort + re-issue with the message appended). Real mid-stream injection needs a streaming/realtime session that accepts new context on an open turn. Powerful, heaviest lift; revisit if we move to a realtime transport.

RealityToday = C1 in a trench coat. True version = new infrastructure.

core

C4 · Epoch / supersede token

Stamp every turn with a monotonic epoch per chat. New input bumps the epoch. Any turn that finishes with a stale epoch discards its output instead of sending. The guardrail that makes C1/C2 safe under races — guarantees the newest read always wins and stale replies die quietly.

WhyWithout it, a slow turn and a fast restart can both reach helper.send → double reply.

D · Concurrency & queueing

Structural guarantees so "one reader" is enforced by the architecture, not by luck.

D1 · Per-chat single-flight liveThe inflight map already serializes turns per chat. Keep it — it's the floor. Nothing else works without it.
D2 · Coalescing queueWhile a turn runs, accumulate inbound into a buffer. When it ends, if the buffer's non-empty, immediately run once over the whole buffer — never one-turn-per-queued-message. Turns the inflight wait into a re-batch instead of a backlog of solo replies.
D3 · Drop-and-replace (newest wins)Newer input supersedes any older queued/in-flight turn for that chat. Combined with D2: don't drain the queue, collapse it.
D4 · Single consumer loop per chatRefactor: one async worker per chat pulling from an inbound queue, owning the COLLECT → THINK → SPEAK state machine. Centralizes all timing/serialization that's currently spread across timers + the inflight chain. Cleaner home for everything in A–C.
D5 · Event-sourced "high-water mark"Track the last row the agent has actually responded to. The agent always answers "everything since the mark." Makes "did we already cover this bubble?" a lookup, not a guess — and makes restarts/barge-in idempotent.

E · Let the model decide

Push the endpointing judgment into intelligence instead of heuristics. The model literally reads the chat and decides whether it's its turn.

strong

E1 · Cheap "is the user done?" classifier

Before the expensive coach turn, a fast small-model (Haiku) call: given the last few bubbles + typing state, return {done, suggestedWaitMs}. Separates the cheap timing decision from the expensive content decision. Smarter than regex, ~nothing in latency/cost.

WinCatches "hold on—" / "actually wait" that punctuation rules miss.

A pre-pass in handleInbound ahead of runAgent.

elegant

E2 · A `wait_for_more` tool

Give the coach a tool it can call instead of replying: "they seem mid-thought, hold ~Ns." The model reads the thread and chooses respond-now vs wait. Endpointing becomes part of the agent's reasoning, not a separate system.

CostSpends a full turn to decide to wait — pricier than E1. Best when the coach's judgment about this user matters.

New tool in TOOLS / runTool; turn yields without sending.

option

E3 · `stay_silent` / no-op reply

Let the turn legitimately produce nothing — the coach decides this fragment doesn't warrant a reply yet and yields. Requires the loop to treat empty output as intentional, not as the "no reply produced" error it logs today.

F · Context shaping — so it reads as one voice

Even with perfect timing, how we lay the burst into the prompt decides whether the reply feels like one person.

F1 · Merge consecutive user bubblesWhen building messages in runAgent, collapse adjacent user turns into one user message. The model sees "what they said," not a stuttered transcript it might answer line-by-line.
F2 · "Unanswered since" framingAnnotate: "The user sent these 3 messages since your last reply — respond to all of them, together, once." Makes the batch explicit and kills the urge to address only the latest.
F3 · Surface what we already saidOn a barge-in re-run (C2), tell the model which bubbles already went out so it continues rather than repeats.
F4 · Keep output coherenceLean on the existing multi-bubble split (toBubbles) — one turn can still look like 3 natural texts, so batching never costs us the texty feel.
F5 · Dedup guard at the send boundaryLast-line defense: if two turns somehow both reach helper.send, suppress the duplicate (pairs with the D-axis epoch).

The turn-taking state machine

Most ideas above are transitions in one small machine, run per chat (D4). Naming the states makes "what happens when a message arrives right now?" answerable in every case:

State	Meaning	New inbound message →
IDLE	nothing pending	→ COLLECTING; start the adaptive/typing-gated window (A2/A3)
COLLECTING	buffering a burst	append; reset window. Optionally kick speculative draft (B1)
THINKING	LLM generating	abort + restart with new context (C1), bump epoch (C4)
SPEAKING	delivering bubbles	finish current bubble, then re-collect & re-run (C2)

Window-elapsed moves COLLECTING → THINKING. Draft-ready + quiet moves THINKING → SPEAKING. Last bubble sent moves SPEAKING → IDLE. The high-water mark (D5) advances on every send.

Layered cheapest-first; each layer stands alone and composes with the next.

1. Adaptive window (A2) + two-tier cap (A4) do firstPure heuristic, zero cost, no new infra. Kills most of the "quick message stuck behind 8s" pain immediately. Highest impact per line of code.
2. Typing-indicator gating (A3)The real fix. Fire on "typing stopped + brief grace," fall back to the A2 window when no typing seen. This is where the delay genuinely disappears.
3. Epoch + single-flight + coalescing queue (C4 / D1 / D2)Make the architecture guarantee one-reader-wins before adding speculation. Without the epoch, the next layer can double-text.
4. Speculative generation (B1) + cancel-restart (C1)Same abort path. Overlap generation with the window so even a conservative window feels instant; barge-in just restarts.
5. Pre-send recheck (C2) + context shaping (F1/F2)Handle the user who speaks mid-reply, and make every batched turn read as one voice.
6. Cheap classifier (E1) if neededAdd only if heuristics + typing still mis-call endpoints. Catches the linguistic "wait—" cases regex can't.

The short version. Watch the typing indicator, vary the window by message completeness, and generate speculatively so the wait overlaps thinking — guarded by a per-chat epoch so only the newest read ever speaks. Steps 1–2 alone retire the complaint; 3–4 make it feel instant.

Config knobs

What becomes tunable as these land (extends today's single SETTLE_MS):

Knob	Controls	Rough start
`WINDOW_COMPLETE_MS`	window when text looks finished (A2)	700–1000ms
`WINDOW_PARTIAL_MS`	window when text looks mid-thought (A2)	5–8s
`WINDOW_HARD_CAP_MS`	fire-no-matter-what ceiling (A4)	10s
`TYPING_GRACE_MS`	quiet needed after "…" stops (A3)	600–1200ms
`SPECULATE`	generate during the window (B1)	on
`MAX_SPEC_RESTARTS`	abort/restart cap per burst (B1/C1)	4–6
`BARGEIN_RECHECK`	peek buffer between bubbles (C2)	on

Living doc. Families A–F are orthogonal — mix per the latency/coherence/cost you're short on. See imessage/chat-poller.mjs (timing, queueing) and src/agent.mjs (generation, barge-in, delivery) for where each lands.

Morpheus — Reading the Chat

The problem, precisely

What we have today

The three axes

Signals we can read

A · When to fire — timing & endpointing

A1 · Fixed debounce live

A2 · Adaptive debounce (content-aware window)

A3 · Typing-indicator gating ("wait-to-talk") recommended

A4 · Two-tier timeout (soft + hard cap)

A5 · Per-user learned cadence

A6 · Time-of-day / engagement awareness

B · Speculation — hide the wait

B1 · Speculative generation during the window recommended

B2 · Speculative + commit-on-quiet

B3 · Always-on draft (eager)

C · Barge-in — input that arrives during a turn

C1 · Cancel-and-restart recommended

C2 · Queue-and-recheck before send

C3 · Mid-stream injection (true barge-in)

C4 · Epoch / supersede token

D · Concurrency & queueing

E · Let the model decide

E1 · Cheap "is the user done?" classifier

E2 · A wait_for_more tool

E3 · stay_silent / no-op reply

F · Context shaping — so it reads as one voice

The turn-taking state machine

Recommended stack

Config knobs

E2 · A `wait_for_more` tool

E3 · `stay_silent` / no-op reply