Casual Sheets May 28, 2026

Rate limiting and room caps for a Hocuspocus + Fastify backend

A production-readiness pass on a Yjs realtime backend: per-route token-bucket rate limiting via @fastify/rate-limit, a MAX_ROOMS cap with two-pass LRU eviction, and a 503 capacity_full envelope. Three small pieces that turn "denial-of-service-by-loop" into "tight-but-bounded write surface."

A Yjs realtime backend with no rate limiting is a script’s playground. Three POST endpoints — create a room, upload a seed, upload a snapshot — and a small loop turns into thousands of rooms, exhausting your room registry, your Redis backing store, or both, in minutes.

We had this exact shape. Public WebSocket gateway, three write-side HTTP endpoints, no per-IP throttle, no upper bound on room count. The room TTL would eventually evict idle rooms, but a patient script could create rooms faster than the GC interval and fill the registry to OOM. Not theoretical — the editor’s live demo at https://sheet.schnsrw.live/ has been getting probed by the usual web-scrapers since launch.

This post is how the Casual Sheets v0.2.0 production pipeline hardened those three endpoints in two compact streams.

Stream C1 — @fastify/rate-limit, per-route

The lazy way to add rate limiting on Fastify is await app.register(rateLimit) and call it done — that applies a default to every route. Don’t do that. A noisy client throttling their own /health probes looks indistinguishable from a backend outage. You want explicit per-route opt-in.

import rateLimit from '@fastify/rate-limit';

await app.register(rateLimit, {
  global: false,                       // ← opt-in per route
  keyGenerator: (req) => req.ip,       // honour trustProxy when set
  addHeaders: {
    'x-ratelimit-limit': true,
    'x-ratelimit-remaining': true,
    'x-ratelimit-reset': true,
    'retry-after': true,
  },
});

Then on each write-side route, declare the bucket:

app.post(
  '/api/rooms',
  {
    config: { rateLimit: { max: 60, timeWindow: '1 minute' } },
  },
  handler,
);

app.post(
  '/api/rooms/:id/seed',
  {
    config: { rateLimit: { max: 12, timeWindow: '1 minute' } },
  },
  handler,
);

app.post(
  '/api/rooms/:id/snapshot',
  {
    config: { rateLimit: { max: 12, timeWindow: '1 minute' } },
  },
  handler,
);

Two buckets, sized differently:

/api/rooms (60/min): room creation is cheap server-side but the easiest abuse vector. 60/min = 1/sec average, plenty for a human + their dev tools, tight enough to throttle a bot.
/api/rooms/:id/seed + /snapshot (12/min): they take bytes into memory before persisting. Tighter bucket because each accepted request can be up to MAX_UPLOAD_MB of memory pressure.

Read endpoints (GET /snapshot, GET /info) deliberately stay unbounded. Returning peers re-joining a room shouldn’t get throttled — that’s the path that gets hammered by every page-reload.

The bucket gives standard 429 + retry-after semantics. No custom envelope; let @fastify/rate-limit do its job and let clients use the standard headers.

Verifying the bucket actually clamps

A separate Node script (apps/server/scripts/loadtest.ts) drives the four endpoints with a configurable VU count + duration. Run it against your server with rate-limit ON and verify the bucket hits at exactly the configured limit:

endpoint              count  errors   429s  p50(ms)  p95(ms)  p99(ms)
-------------------- ------- ------ -------- -------- -------- --------
POST /api/rooms         1162      0     1102      0.9      1.7      2.8
POST /seed                60      0       48      0.6      1.6      3.7
POST /snapshot            60      0       48      0.4      0.9      2.6
GET  /snapshot            60      0        0      0.3      0.7      1.6

From the harness:

/api/rooms: 1162 attempts → 1102 throttled (60 accepted, exactly matching the configured 60/min for a single IP across the 1-minute test window).
/seed + /snapshot: 60 attempts → 48 throttled (12 accepted, matching the 12/min envelope).
GET /snapshot: 60 attempts → 0 throttled (correctly NOT rate-limited).

Zero 5xx in the run. The bucket is the only pushback, exactly as designed.

Stream C2 — MAX_ROOMS cap with two-pass LRU eviction

Rate-limit alone doesn’t bound room count over time. A scripted attacker rate-limited to 60 rooms/min still creates 3600 rooms/hour, 86 400 rooms/day. Without a hard cap, the room registry grows until OOM.

The cap:

const MAX_ROOMS = Number(process.env.MAX_ROOMS ?? 256);

create(opts = {}): string {
  if (this.rooms.size >= MAX_ROOMS) {
    const evicted = this.evictLeastRecent();
    if (!evicted) {
      throw new RoomCapacityError(MAX_ROOMS);
    }
  }
  const id = makeRoomId();
  this.rooms.set(id, /* … */);
  return id;
}

When create() would push past the cap, LRU-evict the oldest evictable room. “Evictable” here means doesn’t carry user data we’d hate to lose: no password set, no seed file uploaded, no snapshot uploaded. Pure throwaway rooms only.

The two-pass design matters:

private evictLeastRecent(): string | null {
  // Pass 1: prefer idle-but-evictable
  let oldestId: string | null = null;
  let oldestIdleSince = Infinity;
  for (const [id, room] of this.rooms) {
    if (!this.isEvictable(room)) continue;
    if (room.idleSince > 0 && room.idleSince < oldestIdleSince) {
      oldestIdleSince = room.idleSince;
      oldestId = id;
    }
  }
  if (oldestId) {
    this.rooms.delete(oldestId);
    this.onEvict?.(oldestId);
    return oldestId;
  }

  // Pass 2: fall back to live-but-evictable by createdAt
  let oldestCreated = '9999-99-99';
  for (const [id, room] of this.rooms) {
    if (!this.isEvictable(room)) continue;
    if (room.createdAt < oldestCreated) {
      oldestCreated = room.createdAt;
      oldestId = id;
    }
  }
  if (oldestId) {
    this.rooms.delete(oldestId);
    this.onEvict?.(oldestId);
    return oldestId;
  }

  return null;  // every slot non-evictable → caller throws
}

Why two passes, not just “pick the oldest by createdAt”:

A naïve LRU that picks by createdAt alone gets defeated by a specific attack pattern. The attacker creates 256 rooms, opens a WebSocket to each (so clients = 1, no longer idle), and parks them. Now every room is “live but no data” and the registry is permanently full — legitimate new users see 503s forever.

The two-pass design:

Pass 1 picks idle-but-evictable. Idle = WebSocket closed, idleSince > 0. Under normal usage, this is plenty of supply.
Pass 2 activates only when every evictable room has live clients. We then kill the oldest live one by createdAt.

This makes the “park sockets to lock out new users” attack unprofitable: every new attacker-created room costs them an existing attacker-created room.

The 503 envelope

When every slot is non-evictable (everyone has a password or uploaded data — rare in practice, common-enough during a real event), create() throws RoomCapacityError. The HTTP layer maps it cleanly:

catch (err) {
  if (err instanceof RoomCapacityError) {
    req.log.warn({ cap: err.cap }, 'room create rejected: capacity full');
    return reply
      .code(503)
      .header('retry-after', '60')
      .send({ error: 'capacity_full', cap: err.cap });
  }
  throw err;
}

503 + retry-after: 60 + a structured body. The client can distinguish “we hit the rate-limit bucket, wait 60s” (429 path) from “the server is full, wait 60s and the operator probably needs to scale up” (503 path).

Eviction calls a hook

The room registry’s eviction callback fires before the room is actually deleted, so the host can also drop persisted Y.Doc bytes in Redis:

this.rooms.delete(oldestId);
this.onEvict?.(oldestId);
return oldestId;

// In the Fastify entry:
rooms.start((evictedId) => {
  storage.delete(evictedId).catch((err) => {
    app.log.warn({ err, roomId: evictedId }, 'storage delete failed');
  });
});

Without this, the in-memory registry forgets the room but Redis keeps the bytes around for the full 7-day TTL — bloat that accumulates over months. The hook ties registry eviction to storage cleanup so the two stay in sync.

What this turns into

Three behaviours that didn’t exist before:

A single IP can’t create more than 60 rooms/minute + can’t upload more than 12 seeds/snapshots/minute. Standard 429 + retry-after.
A patient script can’t push the registry past 256 rooms — any further create either evicts an old idle-evictable room (transparent, no client-visible failure) or returns 503 with structured error semantics.
A “park sockets to fill the registry” attack costs the attacker their own rooms (pass 2 evicts the oldest live- evictable when no idle one is available).

All three are configurable by environment variable (RATE_LIMIT_PER_MIN, UPLOAD_RATE_LIMIT_PER_MIN, MAX_ROOMS), default to safe values, and can be disabled (RATE_LIMIT_ENABLED=false) for load testing where the bucket would mask real failures.

What’s still missing

Honest gaps:

Per-user rate limiting. Today’s bucket is per-IP. Behind a corporate NAT, 500 users share one bucket and throttle each other. The fix is keying the bucket by an authenticated user id (when present) and falling back to IP otherwise.
trustProxy isn’t enabled by default. If you run behind nginx/Caddy/Cloudflare, the bucket sees the proxy’s IP, not the client’s. You need to set Fastify’s trustProxy option for per-IP buckets to mean what you’d expect.
The room cap is per-process. If you horizontally scale to N processes, the effective cap is N × MAX_ROOMS. The room-to- process routing layer (sticky hash on room id) needs to bound its own creation rate.
Redis-side rate limiting for the cluster case — @fastify/rate-limit supports a Redis backend, but we’ve only configured the in-process version. Adding it is one config option but worth noting we don’t have it on today.

Code

The full implementation is in apps/server/src/index.ts

apps/server/src/rooms.ts of Casual Sheets. Six unit tests pin every code path of the two-pass eviction (idle preference, live fallback, all-non-evictable throws, hook fires) at apps/server/src/rooms.unit.test.ts.

The load harness that verifies the bucket clamps at the configured limits is at apps/server/scripts/loadtest.ts; run it with pnpm --filter @sheet/server load.

Casual Sheets is an open-source self-hosted spreadsheet built on Univer OSS + Yjs + Hocuspocus. Apache-2.0. docker run -p 3000:3000 schnsrw/casual-sheets:latest.