Self-hosting — scaling
Single-process limits, Redis path, horizontal scale-out caveats (v0.2 lane).
Single-process today; multi-process is on the v0.2 roadmap with explicit caveats. Honest take below.
Where the limits are (single process)
A 256 MiB container can comfortably host:
- ~100 concurrent rooms
- ~500 concurrent users (across rooms)
- ~50 MiB workbooks open
- ~1000 mutations/min sustained without GC pauses showing up
Real-world bottlenecks before CPU:
- Y.Doc memory pressure — every active room holds its full CRDT graph in memory. Big workbooks with deep edit history inflate the doc; the Stage-6 compaction loop helps but doesn’t help on workbooks held for a multi-day session. Plan ~1–2 MiB resident per active room at steady state.
- WebSocket throughput — Node’s default
wslibrary handles ~10k concurrent connections on a modern x86 vCPU before event- loop latency creeps. Below that, the limit is fan-out of mutations to room peers, which is O(connections × mutations). - xlsx I/O — ExcelJS parse runs in Web Workers on the client side, so the server’s only role in upload/download is bytes-in, bytes-out. No real bottleneck there until you saturate the network.
Vertical scale recipes
- Bigger RAM — 1–2 GiB lets you host hundreds of rooms.
- More CPU — Node is single-threaded but
wsdoes enough off-thread work that 2 vCPU helps the WebSocket I/O loop. - Increase
ROOM_TTL_MIN— rooms held in memory for 60 min after the last client leaves match the JIT working set of a workgroup using the app on a typical workday.
Horizontal scale (v0.2 lane)
Multi-replica deployments work for stateless paths today — File→Open / Download, WOPI, the admin REST — but the WebSocket collab plane has open items:
- Sticky sessions — clients in the same room must land on
the same replica or they don’t see each other. Use a load
balancer’s IP-hash policy on
/yjsupgrades, OR pin via cookie. Caddy + nginx + Traefik all support this; the limitation is informational, not technical. - Cross-replica awareness backplane — Yjs awareness (peer cursors, presence) is in-memory per replica. Two users on the same room on different replicas see each other’s mutations (via the Redis-persisted Y.Doc) but NOT each other’s cursors. v0.2 ships a Redis pub/sub fan-out for awareness.
- Room creation race — two clients hitting
POST /api/roomssimultaneously on different replicas would create two rooms with the same id. Redis SETNX gate lands in v0.2.
So: today, multi-replica is fine for read-mostly fleets where real-time collab is rare AND clients are pinned to a replica. For a “real” multi-replica deployment with collab, wait for v0.2.
Operational signals to watch
GET /health → { ok, rooms: <count> }
GET /api/rooms → [{ id, clients, idleMs, ... }, ...]
GET /api/files/_health → backend health probe
Plus the Fastify access log (pino JSON). Pipe into your
preferred log aggregator + alert on:
responseTimep99 > 500 ms on/wopi/files/:id/contents— storage backend is slow.- 5xx rate > 0.1% — something’s wrong.
- WebSocket disconnects > expected — proxy timeout misconfig or the LB is killing long-lived connections.
Self-host limits — when to upgrade infrastructure
| Signal | Likely cause | Move |
|---|---|---|
rooms > 200 sustained | Healthy! | Add more RAM. |
| Container RSS > 80% of limit | Big-workbook hold | Bump memory; bump ROOM_TTL_MIN down to evict idle rooms sooner. |
| Frequent peer “Out of sync” pills | Network instability or proxy timeout | Bump proxy read_timeout. |
| Storage saves > 1 s p50 | Slow backend | Move local → s3 or postgres co-located with the app. |
Backups + DR
Covered in backups.md. Two things outside the
backup story:
- The admin config JSON file holds secrets (
mode 0600). It’s part of/data; back it up. - The
CASUAL_JWT_SECRETenv var is not in any backup. Lose it and every existing token becomes invalid. Restore from the deployment manifest / secret-manager.
Synced from docs/self-hosting/scaling.md in schnsrw/sheets. To update: edit upstream and re-run npm run sync-docs.