One human is texting. There should be one coach reading the thread and replying once — not N parallel agents racing each other. Every idea we have for getting there.
9 miniMessage is bursty. A human rarely sends one tidy paragraph — they fire a stutter of bubbles as the thought assembles:
The naive transport — one inbound row → one agent turn → one reply — turns that into four overlapping turns. Four LLM calls, four replies, each one blind to the bubbles that came after it. The coach answers "wait" before "pizza" has even arrived. That's the multi-thread failure: the user sees a stranger answering each fragment, not one person reading the message.
We want the opposite: collapse the burst into one coherent read, and reply once — but without making a person who sends a single quick "what should I eat?" sit through a fixed dead-air delay before the coach reacts. Those two goals pull against each other. That tension is the whole design space below.
chat-poller.mjs, each chat has a settle buffer. A new row resets a settleMs (currently 8s) quiet-window timer; only after the window goes silent do the buffered rows fire as ONE onMessage, text joined by newlines. This is a fixed debounce.inflight map chains turns per chat so two batches never run in parallel. Good — but a batch that lands mid-turn just waits its turn; it can't fold into the running one or interrupt it.runAgent replays user.history, so the model already can see prior bubbles — the problem is purely about when we decide the user is done and how late input is handled.deliverBubbles already sends one reply as several bubbles with typing cadence — so a single batched turn still looks like natural texting. We don't need many turns to get many bubbles.The one knob in production is the fixed 8s settle. It's the "annoying delay." Everything below is about beating a fixed delay.
Every idea is a choice on three axes. Naming them keeps the catalog honest:
| Axis | Question | Cheap end ↔ expensive end |
|---|---|---|
| Latency | How long until the coach reacts to a finished thought? | instant ↔ fixed dead-air |
| Coherence | Does one reply address the whole burst? | per-fragment ↔ whole-thread |
| Cost | Tokens / compute burned per human turn? | one call ↔ speculative re-runs |
A fixed debounce trades latency for coherence at zero extra cost. Speculation buys latency back by spending cost. Model-in-the-loop buys accuracy by spending a small call. Pick per the value you're short on.
Smart timing needs evidence that the user is (or isn't) done. What's actually available to us:
| Signal | Source | Tells us |
|---|---|---|
| Typing indicator | chat.db / helper stream (the "…" bubble) | The strongest signal. If they're typing, they're not done. Fire the instant typing stops, not on a blind timer. |
| Terminal punctuation | message text | Ends with ? . ! → likely complete. Ends bare, or on and / so / i / , → likely mid-thought. |
| Inter-bubble gap | row timestamps | Tight gaps (<2s) = a burst in progress. A long gap = thought finished. |
| Message shape | text length / form | A lone "?" or "ok" is complete-and-short → fire fast. A trailing fragment wants a wait. |
| Question vs statement | content | A direct question expects an immediate answer; rambling context can tolerate a beat. |
| Attachment rows | has_attachments | A photo often lands as its own row, with the caption a separate bubble. Almost always wait for the pair. |
| Read receipts / our own state | helper | Whether we've already marked read / started typing back — affects whether interrupting is graceful. |
All of these decide the single question: has the user stopped talking? They live in the poller's settle logic.
What we run now. Reset an N-second timer on each row; fire on silence. Dead simple, robust, zero cost.
scheduleTurn / settleMs.Same machinery, but the window length is a function of the last bubble. Complete-looking text → short window (e.g. 800ms). Mid-thought text → long window (e.g. 6–8s). Replace the constant settleMs with windowFor(rows).
completeness(text) heuristic feeding a variable timeout in scheduleTurn.Subscribe to the typing indicator. Rule: never fire while the "…" is showing. When typing stops, start a short grace timer (~600–1200ms, covers the gap before the next bubble) and fire if it stays quiet. No typing seen at all → treat as done and fire on a short window.
Two timers: a soft one that fires early when signals say "complete," and a hard ceiling (e.g. 10s) that fires no matter what, so we never hang on a flaky signal. A2/A3 ride on top of this.
Track how a given handle texts — burst-of-many vs one-long-paragraph, typical inter-bubble gap — and tune their window over time. Stored per-user alongside the profile.
Tighten the window during an active back-and-forth (recent turns, fast replies); loosen it when the chat's been cold. A cheap multiplier on the base window.
Different bet: don't wait then generate — generate during the wait and throw the draft away if the user keeps talking. Trades tokens for latency.
The moment a bubble lands, kick off the agent turn speculatively against the buffer-so-far. If another bubble arrives, abort and restart with the bigger context. When the window finally closes, the draft is usually already done → send instantly, zero perceived latency.
AbortController on createMessage; restart on each new row; settle-fire just delivers the ready draft.Variant: let the speculative draft run to completion, but hold delivery behind the quiet check. If new input arrived during generation, discard and re-run (this is B1 ∪ barge-in C2). If not, ship it. The "look before you speak" gate.
Regenerate a fresh draft on every row, always have a current answer ready. Lowest latency, highest waste. Only sane if turns are cheap or the model is tiny. Listed for completeness.
The half a debounce can't touch. The user spoke while we were thinking, or while bubbles were still landing on their phone. What do we do with the half-built reply?
New input mid-generation → abort the in-flight LLM call, discard, restart with the new message folded in. Cleanest correctness: whatever we say will have seen everything. Pairs naturally with B1 (same abort path).
AbortController threaded through createMessage in runAgent.Let the turn finish, but in deliverBubbles check the inbound buffer before each bubble. If new input landed, stop, discard remaining bubbles, and re-run with the new message + what was already sent. The delivery loop already pauses between bubbles — natural interrupt points.
deliverBubbles loop.The "pass the new message in while it's generating" idea. With the standard request/response API you can't literally inject into a running call — "injection" reduces to C1 (abort + re-issue with the message appended). Real mid-stream injection needs a streaming/realtime session that accepts new context on an open turn. Powerful, heaviest lift; revisit if we move to a realtime transport.
Stamp every turn with a monotonic epoch per chat. New input bumps the epoch. Any turn that finishes with a stale epoch discards its output instead of sending. The guardrail that makes C1/C2 safe under races — guarantees the newest read always wins and stale replies die quietly.
helper.send → double reply.Structural guarantees so "one reader" is enforced by the architecture, not by luck.
inflight map already serializes turns per chat. Keep it — it's the floor. Nothing else works without it.Push the endpointing judgment into intelligence instead of heuristics. The model literally reads the chat and decides whether it's its turn.
Before the expensive coach turn, a fast small-model (Haiku) call: given the last few bubbles + typing state, return {done, suggestedWaitMs}. Separates the cheap timing decision from the expensive content decision. Smarter than regex, ~nothing in latency/cost.
handleInbound ahead of runAgent.wait_for_more toolGive the coach a tool it can call instead of replying: "they seem mid-thought, hold ~Ns." The model reads the thread and chooses respond-now vs wait. Endpointing becomes part of the agent's reasoning, not a separate system.
TOOLS / runTool; turn yields without sending.stay_silent / no-op replyLet the turn legitimately produce nothing — the coach decides this fragment doesn't warrant a reply yet and yields. Requires the loop to treat empty output as intentional, not as the "no reply produced" error it logs today.
Even with perfect timing, how we lay the burst into the prompt decides whether the reply feels like one person.
messages in runAgent, collapse adjacent user turns into one user message. The model sees "what they said," not a stuttered transcript it might answer line-by-line.toBubbles) — one turn can still look like 3 natural texts, so batching never costs us the texty feel.helper.send, suppress the duplicate (pairs with the D-axis epoch).Most ideas above are transitions in one small machine, run per chat (D4). Naming the states makes "what happens when a message arrives right now?" answerable in every case:
| State | Meaning | New inbound message → |
|---|---|---|
| IDLE | nothing pending | → COLLECTING; start the adaptive/typing-gated window (A2/A3) |
| COLLECTING | buffering a burst | append; reset window. Optionally kick speculative draft (B1) |
| THINKING | LLM generating | abort + restart with new context (C1), bump epoch (C4) |
| SPEAKING | delivering bubbles | finish current bubble, then re-collect & re-run (C2) |
Window-elapsed moves COLLECTING → THINKING. Draft-ready + quiet moves THINKING → SPEAKING. Last bubble sent moves SPEAKING → IDLE. The high-water mark (D5) advances on every send.
Layered cheapest-first; each layer stands alone and composes with the next.
What becomes tunable as these land (extends today's single SETTLE_MS):
| Knob | Controls | Rough start |
|---|---|---|
WINDOW_COMPLETE_MS | window when text looks finished (A2) | 700–1000ms |
WINDOW_PARTIAL_MS | window when text looks mid-thought (A2) | 5–8s |
WINDOW_HARD_CAP_MS | fire-no-matter-what ceiling (A4) | 10s |
TYPING_GRACE_MS | quiet needed after "…" stops (A3) | 600–1200ms |
SPECULATE | generate during the window (B1) | on |
MAX_SPEC_RESTARTS | abort/restart cap per burst (B1/C1) | 4–6 |
BARGEIN_RECHECK | peek buffer between bubbles (C2) | on |