Casual Sheets May 28, 2026

Yjs replay retry: classifying transient vs permanent failures in a CRDT bridge

When a Yjs CRDT bridge fails to apply a remote mutation, the difference between "network blip during chunk-load" and "malformed mutation params" matters a lot. We added a 50-line classifier and a 4-second retry budget. Silent divergences became user-visible recoverable state.

In a Yjs + Hocuspocus collaborative editor, the replay loop on each peer is where remote mutations become local state. Something arrives in the op log (Y.Array.observe), you dispatch cmdSvc.executeCommand(rec.id, params, { fromCollab: true }), your local state moves to match the sender’s.

Before this fix, our error handler was:

.catch((err) => {
  console.warn('[collab] replay failed for', rec.id, err);
  noteReplayFailure();  // increments a counter
});

noteReplayFailure just bumped a counter, surfaced eventually in the UI as “N edits from peers couldn’t be applied to your view — refresh to resync.” A real warning! But every replay failure looked the same to it. That conflation was the bug.

Two kinds of failure

Mutations fail for very different reasons:

Transient failures:

Webpack chunk-load failed during the lazy-plugin gate
“Loading chunk 42 failed” from the bundler
Vite’s “Failed to fetch dynamically imported module”
NetworkError on a flaky connection

These are network or bundler hiccups. Retry once the network recovers and the mutation lands cleanly. The state the mutation describes is still valid — the chunk just wasn’t ready.

Permanent failures:

Mutation params don’t match the registered command’s expected shape (e.g., params.value is undefined)
The mutation references a sheet/range that doesn’t exist locally
Unknown command id (sender ran a plugin version we don’t have)

These will fail the same way every time. Retrying just re-throws the same stack trace. Burning the retry budget here is wasted attempts that delay the dead-letter.

Pre-fix, both classes hit the same noteReplayFailure() and got the same “refresh to resync” treatment. Transients incremented a counter forever because they never got a second chance. Permanents incremented the counter forever because they couldn’t be fixed without code changes.

The 50-line classifier

export function classifyReplayError(err: unknown): 'transient' | 'permanent' {
  if (err == null) return 'permanent';

  const e = err as { name?: unknown; message?: unknown };
  const name = typeof e.name === 'string' ? e.name : '';
  const message = typeof e.message === 'string' ? e.message : String(err);

  // ChunkLoadError sets .name even though it's not on stock Error
  if (name === 'ChunkLoadError') return 'transient';

  const lower = message.toLowerCase();
  if (lower.includes('loading chunk') && lower.includes('failed')) return 'transient';
  if (lower.includes('failed to fetch dynamically imported')) return 'transient';
  if (lower.includes('networkerror when attempting to fetch')) return 'transient';
  if (lower.includes('network request failed')) return 'transient';

  return 'permanent';
}

Five rules. Conservative bias: when in doubt, return 'permanent'. A false negative costs us a dead-letter entry the user can recover from with a refresh; a false positive wastes 4 seconds of retries on a known-broken mutation. The cost asymmetry favours conservative.

The five rules cover what we’ve actually seen in the wild. Webpack 5 sets name === 'ChunkLoadError'; webpack 4 and Vite both throw Error('Loading chunk N failed') shaped messages but don’t set a distinctive name. Native ESM dynamic-import failures land as Failed to fetch dynamically imported module. The fetch wrappers add NetworkError when attempting to fetch and Network request failed. That’s the full transient set we trust today.

Anything else — TypeError from bad params, application errors from the command handler, custom Error subclasses we haven’t catalogued — is permanent until proven otherwise.

The retry schedule

TRANSIENT_RETRY_DELAYS_MS = [300, 900, 2700].

Three attempts after the initial try, totalling 3.9 seconds. Tuned for “a network flap lasting a few seconds should be invisible to the user.” Past 4 seconds, recovery via refresh becomes the right UX (the user notices the staleness).

The delays are exponential-ish but not pure geometric — 300, 900, 2700 matches the natural rhythm of “retry, wait longer, give up.” A pure geometric schedule (300, 600, 1200, 2400) would also work but the four-attempt budget felt like the right tradeoff between recovery probability and time-to-dead-letter.

The retry helper

export async function withRetry<T>(
  task: () => Promise<T>,
  delays: readonly number[],
  shouldRetry: (err: unknown, attemptsSoFar: number) => boolean = () => true,
  sleep: (ms: number) => Promise<void> = (ms) => new Promise((r) => setTimeout(r, ms)),
): Promise<T> {
  let lastErr: unknown;
  for (let i = 0; i <= delays.length; i += 1) {
    try {
      return await task();
    } catch (err) {
      lastErr = err;
      const attempts = i + 1;
      if (i >= delays.length) break;
      if (!shouldRetry(err, attempts)) break;
      await sleep(delays[i]);
    }
  }
  throw lastErr;
}

The sleep parameter is injectable specifically so unit tests can pass a no-op and run the whole schedule in microseconds instead of 4 real seconds. The shouldRetry callback runs between attempts; it lets the caller switch to permanent-failure handling mid-sequence if a subsequent error reveals the real problem was permanent all along.

The bridge composes them:

const attempt = () =>
  (lazyGroup ? ensurePluginByName(lazyGroup) : Promise.resolve())
    .then(() => cmdSvc.executeCommand(rec.id, params, { fromCollab: true }));

void withRetry(attempt, TRANSIENT_RETRY_DELAYS_MS, (err) =>
  classifyReplayError(err) === 'transient'
).catch((err: unknown) => {
  // Final failure — dead-letter it.
  const cls = classifyReplayError(err);
  const record: ReplayFailureRecord = {
    id: rec.id,
    params: rec.p,
    lastError: err instanceof Error ? err.message : String(err),
    attempts: cls === 'transient' ? 1 + TRANSIENT_RETRY_DELAYS_MS.length : 1,
    firstFailedAt: Date.now(),
    lastFailedAt: Date.now(),
    classification: cls,
  };
  noteReplayFailure(record);  // pushes to dead-letter ring buffer
});

The dead-letter is a ring buffer (cap 20) of these ReplayFailureRecord objects. The bridge exposes getReplayDeadLetter() + subscribeReplayDeadLetter(). The UI now shows them in a click-to-expand popover on the connection-status pill:

┌────────────────────────────────────────────┐
│ Replay failures      Showing latest 3 of 5 │
├────────────────────────────────────────────┤
│ set-range-values [permanent]      12s ago  │
│   Cannot read properties of undefined…     │
├────────────────────────────────────────────┤
│ set-conditional-rule [transient]  45s ago  │
│   Loading chunk 7 failed.                  │
├────────────────────────────────────────────┤
│ set-drawing-apply [permanent]     2m ago   │
│   Error                                    │
├────────────────────────────────────────────┤
│ Refresh usually recovers. Persistent       │
│ failures: copy a sample and file an issue. │
└────────────────────────────────────────────┘

Per-record info instead of a vague counter. Permanent failures show in red, transients in amber. The colour signals which “refresh” actually means: for transient, “the retry budget exhausted but the underlying network/bundler will recover on next reload;” for permanent, “you’d need to either remove the offending document content or upgrade the editor to a version that knows the mutation.”

Why the UI matters

The dead-letter ring buffer makes silent divergences diagnosable without code access. Pre-fix, a real user hitting a real divergence saw a counter increment and a refresh recommendation; the engineer got no signal at all unless the user opened DevTools and screenshotted console warnings.

Now the engineer can ask: “open the connection pill, screenshot the failure list” — and gets a structured payload that’s enough to know whether the bug is:

A bundler-deploy issue (transient cluster, all the same chunk failing) — likely a CDN cache mismatch
A plugin-version skew (permanent cluster, all the same command id)
A data-shape regression (permanent, mixed command ids with similar error messages) — usually a recent change to a mutation handler

That triage is impossible from a counter. It takes ~5 seconds from the dead-letter snapshot.

What’s still hard

A few things this design doesn’t solve:

Dead-letters don’t replay. Once a transient failure burns through its 4-second budget, the only recovery path is “user refreshes the page.” We deliberately don’t auto-retry from the dead-letter — if it failed three times in four seconds, the issue is probably more than transient. A future feature could expose a “retry all” button on the popover, but we haven’t needed it.

The classifier is closed-set. New transient signatures from future bundlers (Bun? Rspack? esbuild’s HMR-style errors?) won’t be caught until we add their patterns. A regex-based approach would be more flexible but also more error-prone; the explicit-pattern approach makes regressions easy to spot in code review.

Permanent doesn’t always mean “give up.” A “mutation references unknown sheet id” failure today might become recoverable tomorrow if the missing sheet is created. We currently don’t re-evaluate dead-letter entries when local state changes. A future improvement could pop entries off the dead-letter when their precondition is likely to have changed (e.g., a new sheet creation re-tries any failed mutations targeting that sheet id).

Code

The classifier + retry helper are in apps/web/src/collab/replay-retry.ts of Casual Sheets. 17 unit tests cover every classifier branch, the retry path, the ring buffer’s eviction, and the pinned retry schedule. The UI surface lives in apps/web/src/shell/CollabIndicator.tsx.

The full design lived as Stream A1 + A2 of the production-readiness pipeline; both shipped in v0.2.0.

Casual Sheets is an open-source self-hosted spreadsheet built on Univer OSS + Yjs + Hocuspocus. Apache-2.0. docker run -p 3000:3000 schnsrw/casual-sheets:latest.