Design a collaborative document editor
Lead a frontend system design interview for a collaborative document editor. Cover requirements, state ownership, data fetching, rendering, accessibility, performance, testing, and rollout.
Answer Strategy
A collaborative document editor is the system design question that exposes whether you can name a merge strategy. The first answer should NOT be a component diagram. It should be: "two clients can edit the same character position; resolve concurrent edits with a CRDT (Yjs, Automerge) for offline-first or with OT (Quill, ShareDB) for centralized coordination." Pick one and defend it before drawing any boxes.
Separate the layers explicitly. Local state owns the user's draft and pending ops not yet acked. The transport owns delivery, ordering, and reconnection. The merge layer owns conflict resolution. The render layer owns the actual textarea/contenteditable plus selection mapping. The hardest bug in this domain is selection drift: a remote op arrives, the doc is rebased, and the cursor jumps. Naming that bug up front earns senior signal.
Volunteer the production failures. Without lamport clocks or vector clocks, two clients arguing causes the doc to bounce. Without rebase on receive, local pending edits double-apply when the server replays. Without an offline queue, a flaky network drops user keystrokes. Without presence (cursors, selections, names), users overwrite each other silently. Without an undo stack scoped to the local user, ctrl+Z undoes a colleague's edit. The reference shows the smallest plausible kernel: applyOp, rebase, and a tiny editor wired to send/receive.
Reference Implementation: Doc State Kernel With Apply And Rebase
A pure applyOp + rebase kernel that a component composes. Production code would replace this with a CRDT or OT library; the interview is about owning the merge contract.
type ClientId = string;
type DocOp =
| { type: 'insert'; clientId: ClientId; lamport: number; index: number; text: string }
| { type: 'delete'; clientId: ClientId; lamport: number; index: number; length: number };
type DocState = {
text: string;
// Per-client lamport clock keeps causal order without a centralized clock.
clocks: Record<ClientId, number>;
// Pending ops we have applied locally but not yet acked by the server, so
// we can rebase them when remote ops arrive out of order.
pending: DocOp[];
};
// Pure transition. The interview point is *naming the merge strategy*.
// CRDT (e.g. Yjs) and OT (e.g. Quill) are both valid; this skeleton picks
// last-writer-wins by lamport with a stable client-id tiebreak.
function applyOp(state: DocState, op: DocOp): DocState {
const next: DocState = { ...state, clocks: { ...state.clocks }, pending: state.pending };
next.clocks[op.clientId] = Math.max(next.clocks[op.clientId] ?? 0, op.lamport);
if (op.type === 'insert') {
next.text = state.text.slice(0, op.index) + op.text + state.text.slice(op.index);
} else {
next.text = state.text.slice(0, op.index) + state.text.slice(op.index + op.length);
}
return next;
}
// Rebase local pending ops over a remote op so the local view stays
// consistent. This is intentionally simplified: production code uses a
// CRDT library and lets the library do the position math.
function rebase(local: DocOp[], remote: DocOp): DocOp[] {
return local.map((op) => {
if (remote.type === 'insert' && op.index >= remote.index) {
return { ...op, index: op.index + remote.text.length };
}
if (remote.type === 'delete' && op.index >= remote.index) {
return { ...op, index: Math.max(remote.index, op.index - remote.length) };
}
return op;
});
}
type DocumentEditorProps = {
initial: DocState;
send: (op: DocOp) => void;
receive: (handler: (op: DocOp) => void) => () => void;
clientId: ClientId;
};
export function DocumentEditor({ initial, send, receive, clientId }: DocumentEditorProps) {
const [state, setState] = React.useState<DocState>(initial);
React.useEffect(() => {
return receive((remote) => {
setState((current) => {
const rebased = rebase(current.pending, remote);
const merged = applyOp({ ...current, pending: rebased }, remote);
return merged;
});
});
}, [receive]);
function localInsert(index: number, text: string) {
const lamport = (state.clocks[clientId] ?? 0) + 1;
const op: DocOp = { type: 'insert', clientId, lamport, index, text };
setState((current) => applyOp({ ...current, pending: [...current.pending, op] }, op));
send(op);
}
return (
<textarea
value={state.text}
aria-label="Document body"
onChange={(event)=> {
// Tiny illustrative diff: in production you would compute the
// textual delta and translate it to insert/delete ops.
const next= event.target.value;
if (next.length > state.text.length) {
localInsert(state.text.length, next.slice(state.text.length));
}
}}
/>
);
}Runnable Playground
Edit the implementation and run the tests directly in the browser. For system design questions, the playground focuses on the core state/data logic that the UI would call.
type ClientId = string;
type DocOp =
| { type: 'insert'; clientId: ClientId; lamport: number; index: number; text: string }
| { type: 'delete'; clientId: ClientId; lamport: number; index: number; length: number };
type DocState = {
text: string;
clocks: Record<ClientId, number>;
pending: DocOp[];
};
function applyOp(state: DocState, op: DocOp): DocState {
const next: DocState = { ...state, clocks: { ...state.clocks }, pending: state.pending };
next.clocks[op.clientId] = Math.max(next.clocks[op.clientId] ?? 0, op.lamport);
if (op.type === 'insert') {
next.text = state.text.slice(0, op.index) + op.text + state.text.slice(op.index);
} else {
next.text = state.text.slice(0, op.index) + state.text.slice(op.index + op.length);
}
return next;
}
function rebase(local: DocOp[], remote: DocOp): DocOp[] {
return local.map((op) => {
if (remote.type === 'insert' && op.index >= remote.index) {
return { ...op, index: op.index + remote.text.length };
}
if (remote.type === 'delete' && op.index >= remote.index) {
return { ...op, index: Math.max(remote.index, op.index - remote.length) };
}
return op;
});
}
Testing Strategy
Convert the answer into observable behavior. In a mid-senior interview, say which behaviors are covered by unit tests, interaction tests, accessibility checks, and one browser smoke path.
import { describe, it, expect } from 'vitest';
describe('applyOp', () => {
const empty = { text: '', clocks: {}, pending: [] };
it('inserts text at an index', () => {
const next = applyOp(empty, { type: 'insert', clientId: 'a', lamport: 1, index: 0, text: 'Hi' });
expect(next.text).toBe('Hi');
expect(next.clocks.a).toBe(1);
});
it('deletes a range', () => {
const seeded = { text: 'Hello world', clocks: {}, pending: [] };
const next = applyOp(seeded, { type: 'delete', clientId: 'a', lamport: 1, index: 5, length: 6 });
expect(next.text).toBe('Hello');
});
});
describe('rebase', () => {
it('shifts local indices forward when a remote insert arrives earlier', () => {
const local: any = [{ type: 'insert', clientId: 'a', lamport: 1, index: 5, text: 'X' }];
const remote = { type: 'insert', clientId: 'b', lamport: 1, index: 0, text: 'AAA' };
const next = rebase(local, remote as any);
expect(next[0].index).toBe(8);
});
});Interviewer Signal
Shows whether you can turn a broad product surface into a durable frontend architecture with clear contracts.
Constraints
- Spend the first five minutes on requirements and non-goals.
- Name client, server, cache, and URL state separately.
- Include accessibility, performance, and observability before the end.
Model Answer Shape
- Clarify users, scale, latency, collaboration, offline, and device constraints.
- Draw the route/component/data-flow shape before diving into component props.
- Choose explicit boundaries for API clients, cache, local state, design-system primitives, and tests.
Tradeoffs
- Generic primitives increase reuse but require stronger documentation and ownership.
- Client-side richness improves speed after load but can raise hydration and bundle costs.
- Real-time updates help freshness but complicate ordering, backpressure, and recovery.
Edge Cases
- Slow network and partial data.
- Permission changes while the user is on the page.
- Large datasets, long sessions, and stale caches.
Testing And Proof
- Contract tests for API adapters.
- Interaction tests for critical workflows.
- Performance budget and E2E scenario for the most important path.
Follow-Ups
- How would you roll this out safely to 1% of users?
- What would become a shared platform primitive after the second product adopted it?