Morpheus — Reading the Chat

One human is texting. There should be one coach reading the thread and replying once — not N parallel agents racing each other. Every idea we have for getting there.

9 min

2026-06-18 · companion to imessage/chat-poller.mjs + src/agent.mjs · the message-batching / turn-taking problem

The problem, precisely

iMessage is bursty. A human rarely sends one tidy paragraph — they fire a stutter of bubbles as the thought assembles:

themwait
themso i just had like 3 slices
themof pizza
themis that bad lol

The naive transport — one inbound row → one agent turn → one reply — turns that into four overlapping turns. Four LLM calls, four replies, each one blind to the bubbles that came after it. The coach answers "wait" before "pizza" has even arrived. That's the multi-thread failure: the user sees a stranger answering each fragment, not one person reading the message.

We want the opposite: collapse the burst into one coherent read, and reply once — but without making a person who sends a single quick "what should I eat?" sit through a fixed dead-air delay before the coach reacts. Those two goals pull against each other. That tension is the whole design space below.

Two distinct sub-problems, often conflated. (1) Coalescing — bubbles that arrive before we start generating. (2) Barge-in — bubbles that arrive while we're generating or already mid-reply. A debounce window only solves (1). A complete solution needs both.

What we have today

The one knob in production is the fixed 8s settle. It's the "annoying delay." Everything below is about beating a fixed delay.

The three axes

Every idea is a choice on three axes. Naming them keeps the catalog honest:

AxisQuestionCheap end ↔ expensive end
LatencyHow long until the coach reacts to a finished thought?instant ↔ fixed dead-air
CoherenceDoes one reply address the whole burst?per-fragment ↔ whole-thread
CostTokens / compute burned per human turn?one call ↔ speculative re-runs

A fixed debounce trades latency for coherence at zero extra cost. Speculation buys latency back by spending cost. Model-in-the-loop buys accuracy by spending a small call. Pick per the value you're short on.

Signals we can read

Smart timing needs evidence that the user is (or isn't) done. What's actually available to us:

SignalSourceTells us
Typing indicatorchat.db / helper stream (the "…" bubble)The strongest signal. If they're typing, they're not done. Fire the instant typing stops, not on a blind timer.
Terminal punctuationmessage textEnds with ? . ! → likely complete. Ends bare, or on and / so / i / , → likely mid-thought.
Inter-bubble gaprow timestampsTight gaps (<2s) = a burst in progress. A long gap = thought finished.
Message shapetext length / formA lone "?" or "ok" is complete-and-short → fire fast. A trailing fragment wants a wait.
Question vs statementcontentA direct question expects an immediate answer; rambling context can tolerate a beat.
Attachment rowshas_attachmentsA photo often lands as its own row, with the caption a separate bubble. Almost always wait for the pair.
Read receipts / our own statehelperWhether we've already marked read / started typing back — affects whether interrupting is graceful.
The typing indicator is the unlock. Most of the "annoying delay" problem dissolves if we stop guessing with a timer and instead watch the "…". Quick single message, no typing → fire in well under a second. Still typing → wait as long as they keep going. Everything in family A is a fallback for when typing data is unavailable or unreliable.

A · When to fire — timing & endpointing

All of these decide the single question: has the user stopped talking? They live in the poller's settle logic.

baseline

A1 · Fixed debounce live

What we run now. Reset an N-second timer on each row; fire on silence. Dead simple, robust, zero cost.

CostNone.
PainEvery message — even an obvious finished question — eats the full N seconds.
Lives in scheduleTurn / settleMs.
strong

A2 · Adaptive debounce (content-aware window)

Same machinery, but the window length is a function of the last bubble. Complete-looking text → short window (e.g. 800ms). Mid-thought text → long window (e.g. 6–8s). Replace the constant settleMs with windowFor(rows).

CostNone — pure heuristic.
WinQuick "what should I eat?" fires almost instantly; trailing "so i had…" still waits.
A completeness(text) heuristic feeding a variable timeout in scheduleTurn.
strong

A3 · Typing-indicator gating ("wait-to-talk") recommended

Subscribe to the typing indicator. Rule: never fire while the "…" is showing. When typing stops, start a short grace timer (~600–1200ms, covers the gap before the next bubble) and fire if it stays quiet. No typing seen at all → treat as done and fire on a short window.

CostNone — just another event source.
WinLatency collapses to near-zero for finished messages; long bursts are respected naturally.
RiskTyping indicators aren't 100% reliable over all transports — keep A1/A2 as a hard-cap fallback.
New typing feed in the poller → gates the settle timer.
strong

A4 · Two-tier timeout (soft + hard cap)

Two timers: a soft one that fires early when signals say "complete," and a hard ceiling (e.g. 10s) that fires no matter what, so we never hang on a flaky signal. A2/A3 ride on top of this.

WhySafety net. Makes every adaptive scheme fail safe instead of fail silent.
nice-to-have

A5 · Per-user learned cadence

Track how a given handle texts — burst-of-many vs one-long-paragraph, typical inter-bubble gap — and tune their window over time. Stored per-user alongside the profile.

WinTailors latency to the actual human instead of a global guess.
CostState + tuning complexity; marginal over A2/A3. A later optimization, not a first move.
flavor

A6 · Time-of-day / engagement awareness

Tighten the window during an active back-and-forth (recent turns, fast replies); loosen it when the chat's been cold. A cheap multiplier on the base window.

B · Speculation — hide the wait

Different bet: don't wait then generate — generate during the wait and throw the draft away if the user keeps talking. Trades tokens for latency.

high-impact

B1 · Speculative generation during the window recommended

The moment a bubble lands, kick off the agent turn speculatively against the buffer-so-far. If another bubble arrives, abort and restart with the bigger context. When the window finally closes, the draft is usually already done → send instantly, zero perceived latency.

WinBest-case latency ≈ 0 even with a conservative window — the wait overlaps generation instead of preceding it.
CostWasted tokens on every aborted draft. Bounded by burst length; cap restarts.
An AbortController on createMessage; restart on each new row; settle-fire just delivers the ready draft.
partial

B2 · Speculative + commit-on-quiet

Variant: let the speculative draft run to completion, but hold delivery behind the quiet check. If new input arrived during generation, discard and re-run (this is B1 ∪ barge-in C2). If not, ship it. The "look before you speak" gate.

expensive

B3 · Always-on draft (eager)

Regenerate a fresh draft on every row, always have a current answer ready. Lowest latency, highest waste. Only sane if turns are cheap or the model is tiny. Listed for completeness.

C · Barge-in — input that arrives during a turn

The half a debounce can't touch. The user spoke while we were thinking, or while bubbles were still landing on their phone. What do we do with the half-built reply?

core

C1 · Cancel-and-restart recommended

New input mid-generation → abort the in-flight LLM call, discard, restart with the new message folded in. Cleanest correctness: whatever we say will have seen everything. Pairs naturally with B1 (same abort path).

CostTokens already spent on the killed call.
AbortController threaded through createMessage in runAgent.
core

C2 · Queue-and-recheck before send

Let the turn finish, but in deliverBubbles check the inbound buffer before each bubble. If new input landed, stop, discard remaining bubbles, and re-run with the new message + what was already sent. The delivery loop already pauses between bubbles — natural interrupt points.

WinNo wasted in-flight call; just unsent bubbles dropped. Cheaper than C1.
RiskIf we already sent bubble 1 of 3, the re-run must own that — feed it "you already said X."
A buffer-peek between iterations of the deliverBubbles loop.
advanced

C3 · Mid-stream injection (true barge-in)

The "pass the new message in while it's generating" idea. With the standard request/response API you can't literally inject into a running call — "injection" reduces to C1 (abort + re-issue with the message appended). Real mid-stream injection needs a streaming/realtime session that accepts new context on an open turn. Powerful, heaviest lift; revisit if we move to a realtime transport.

RealityToday = C1 in a trench coat. True version = new infrastructure.
core

C4 · Epoch / supersede token

Stamp every turn with a monotonic epoch per chat. New input bumps the epoch. Any turn that finishes with a stale epoch discards its output instead of sending. The guardrail that makes C1/C2 safe under races — guarantees the newest read always wins and stale replies die quietly.

WhyWithout it, a slow turn and a fast restart can both reach helper.send → double reply.

D · Concurrency & queueing

Structural guarantees so "one reader" is enforced by the architecture, not by luck.

E · Let the model decide

Push the endpointing judgment into intelligence instead of heuristics. The model literally reads the chat and decides whether it's its turn.

strong

E1 · Cheap "is the user done?" classifier

Before the expensive coach turn, a fast small-model (Haiku) call: given the last few bubbles + typing state, return {done, suggestedWaitMs}. Separates the cheap timing decision from the expensive content decision. Smarter than regex, ~nothing in latency/cost.

WinCatches "hold on—" / "actually wait" that punctuation rules miss.
A pre-pass in handleInbound ahead of runAgent.
elegant

E2 · A wait_for_more tool

Give the coach a tool it can call instead of replying: "they seem mid-thought, hold ~Ns." The model reads the thread and chooses respond-now vs wait. Endpointing becomes part of the agent's reasoning, not a separate system.

CostSpends a full turn to decide to wait — pricier than E1. Best when the coach's judgment about this user matters.
New tool in TOOLS / runTool; turn yields without sending.
option

E3 · stay_silent / no-op reply

Let the turn legitimately produce nothing — the coach decides this fragment doesn't warrant a reply yet and yields. Requires the loop to treat empty output as intentional, not as the "no reply produced" error it logs today.

F · Context shaping — so it reads as one voice

Even with perfect timing, how we lay the burst into the prompt decides whether the reply feels like one person.

The turn-taking state machine

Most ideas above are transitions in one small machine, run per chat (D4). Naming the states makes "what happens when a message arrives right now?" answerable in every case:

StateMeaningNew inbound message →
IDLEnothing pending→ COLLECTING; start the adaptive/typing-gated window (A2/A3)
COLLECTINGbuffering a burstappend; reset window. Optionally kick speculative draft (B1)
THINKINGLLM generatingabort + restart with new context (C1), bump epoch (C4)
SPEAKINGdelivering bubblesfinish current bubble, then re-collect & re-run (C2)

Window-elapsed moves COLLECTING → THINKING. Draft-ready + quiet moves THINKING → SPEAKING. Last bubble sent moves SPEAKING → IDLE. The high-water mark (D5) advances on every send.

Recommended stack

Layered cheapest-first; each layer stands alone and composes with the next.

The short version. Watch the typing indicator, vary the window by message completeness, and generate speculatively so the wait overlaps thinking — guarded by a per-chat epoch so only the newest read ever speaks. Steps 1–2 alone retire the complaint; 3–4 make it feel instant.

Config knobs

What becomes tunable as these land (extends today's single SETTLE_MS):

KnobControlsRough start
WINDOW_COMPLETE_MSwindow when text looks finished (A2)700–1000ms
WINDOW_PARTIAL_MSwindow when text looks mid-thought (A2)5–8s
WINDOW_HARD_CAP_MSfire-no-matter-what ceiling (A4)10s
TYPING_GRACE_MSquiet needed after "…" stops (A3)600–1200ms
SPECULATEgenerate during the window (B1)on
MAX_SPEC_RESTARTSabort/restart cap per burst (B1/C1)4–6
BARGEIN_RECHECKpeek buffer between bubbles (C2)on

Living doc. Families A–F are orthogonal — mix per the latency/coherence/cost you're short on. See imessage/chat-poller.mjs (timing, queueing) and src/agent.mjs (generation, barge-in, delivery) for where each lands.