Building CineSync: Three Versions in Five Months

The fourth problem: keeping multiple people, on multiple devices, editing the same production data — without anyone losing work.

The week the backend started shipping, the first reasonable question came in from a director:

“Can I edit the shot list on my laptop while my AD is on her iPad?”

Of course. Obviously. That’s the entire point.

It took three versions of the sync engine to get there — not because the first two were wrong, but because sync is the kind of problem you can’t fully see from a whiteboard. We couldn’t pause feature work for six months to design the perfect system upfront; we needed to keep shipping so users could try things and tell us what worked. Each version was a semi-deliberate trade-off: build the simplest thing that’s good enough for the current scope, ship it, learn from how real usage stretches it, and redesign the next version with what you now know. The work spanned about five months and reshaped the architecture three times.

v1: GraphQL for data, WebSockets for “something changed”

The proof of concept used the foundation we already had. The backend exposed a GraphQL API and we used it for everything — fetching a project, hydrating the local state, sending changes back. A separate WebSocket connection ran alongside it, but only as a notification channel. When something changed on the server, the WebSocket fired a small message identifying what had changed — “this entity in this project was updated, fetch it.” The client then re-queried that specific slice via GraphQL and re-rendered.

This was the simplest thing that could possibly work, and it did work for the first scope: keeping one user’s data consistent across their own devices — laptop, tablet, phone. It let us validate the sync mechanism end-to-end before adding the harder problem of two different people editing simultaneously.

What we hit fairly quickly was that GraphQL wasn’t the right tool for the granularity we needed. Every “something changed” event meant re-fetching a meaningful slice of the project. Composing queries to only re-fetch the changed parts was always a wrestling match. As the data model grew — more entity types, more relationships, more per-project state — the resolver layer got harder to scale to the kind of granular, transaction-by-transaction sync we knew we wanted.

The notify-then-fetch pattern is also fundamentally a round-trip pattern. Even when the notification told the client exactly which entity had changed, the actual new state still had to come back over a second request. Every change cost you “notify, fetch, apply” — three steps where granular sync wants to be one.

v2: WebSockets all the way through

Version two was the architectural commitment. We migrated both directions — the client sending changes to the server and the server pushing changes back — entirely onto WebSockets. Transactions flow over the same persistent connection both ways. No more “notify, then fetch”: when something changes, the actual change is what comes across the wire.

This was the right model for granular sync. Each transaction was small, self-describing, and could be applied directly to a client’s local state without an extra round trip to refetch anything.

The problem v2 introduced was the inverse of v1’s. Now the WebSocket connection was carrying everything, including operations that didn’t really belong on a real-time channel. Bulk operations — populating a shot list from a script, applying a large reorder, deleting a chain of related items — bottlenecked on the same connection that was supposed to be delivering small, instant transactions. Concurrent edits across multiple connections occasionally produced subtle ordering bugs that were hard to reproduce and harder to reason about. Authorization checks were scattered across handlers, which made it hard to enforce permissions uniformly as the access-control model grew.

The transport was right for the small case. It wasn’t right for everything.

v3: REST out, WebSocket in — with a queue in between

Version three split the path back apart, but on different lines than v1 had drawn.

Outbound — anything the client wanted the server to do — moved to dedicated REST endpoints. The endpoints didn’t apply transactions directly. They handed them to a server-side transaction queue, which processed them in order, with proper authorization checks at the role and project level, and with explicit conflict resolution rules for the cases where two transactions touched the same data.

Inbound — anything the client needed to know — kept the WebSocket channel. The server, after processing a transaction from the queue and resolving any conflicts, broadcasts the canonical result to every connected device. Clients apply the resolved state, not the requested state.

This gave us three things v2 couldn’t:

A clear back-pressure model. Large transactions don’t block small ones. The queue absorbs the load.
Proper ACL enforcement at the boundary. Permissions are checked once, in one place, instead of being scattered through socket handlers.
A single point of truth for conflict resolution. Two clients submitting conflicting transactions go through the same queue; the server decides who wins; both clients receive the same resolved state.

The GraphQL layer didn’t survive this iteration. By the time v3 was solid, the data access patterns it enabled were redundant with what the new sync model gave us natively, and the maintenance overhead wasn’t justifying itself. We removed GraphQL outright.

Optimistic updates and the outbox pattern

The most interesting thing the v3 architecture made possible — and the part we’d quietly been working toward across all three versions — was real optimistic UI.

When you make an edit, you don’t wait for the server. The transaction is written to a local outbox — a table in the device’s database that holds “things this device has done that the server hasn’t yet confirmed.” The UI updates instantly by rendering the canonical project state overlaid with whatever’s currently in the outbox, so the change is visible immediately even though the canonical local project tables haven’t moved yet.

A background worker picks up outbox entries and posts them, one at a time, to the REST queue on the server. The server processes each transaction, resolves any conflicts against the canonical state, and broadcasts the resolved result over WebSocket to every connected device. When the broadcast comes back to the originating device, the client applies the server-resolved result to its canonical project tables and clears the corresponding outbox entry. If the server resolved things differently than the optimistic display assumed (because of a conflict, a permission denial, or another device having won the race), the UI reconciles to match the canonical state automatically.

This pattern is what makes CineLog feel fast on flaky connections. The UI never waits. The outbox keeps the truth of “what this device tried to do,” the server keeps the truth of “what actually happened,” and the gap between them is reconciled in the background. You can lose your network mid-edit; the outbox still holds your transaction; when the network comes back, it drains; the server catches up; the broadcast arrives; everything ends up consistent.

It’s also the architectural seed that grew into the offline-capable rebuild a couple of months later — a separate article in this series. Once you have a local outbox holding your pending changes and reconciling against the server, you’re most of the way there. The server stays the source of truth; you just extend how long the client can keep working before it has to hear back from it.

Where collaborative sync actually breaks

The v3 model is what we run today. The shape of it — queue, conflict resolution, broadcast — was driven by specific cases that broke under v1 and v2. The ones we kept hitting:

Reorder operations. When you drag a shot from position 3 to position 7, what you’ve really done is renumber seven other shots. If you naively send “set the sequence number of shot X to 7,” and someone else simultaneously sets shot Y to 7, you collide on a unique constraint and the whole operation fails.

We solved this with fractional indexing — a clever ordering technique where the “position” isn’t an integer but a fractional value between adjacent items. Drop a shot between two existing ones and it gets a sequence value halfway between theirs. No renumbering, no collisions, and other devices apply the change without disturbing anything else.

Splitting and merging scenes by moving a banner. A production day banner can sit between scenes or inside one. When it sits inside a scene, that scene is split across two production days — some shots shoot on Day 1, the rest on Day 2. Move the banner up or down and the assignments cascade, including which shots of a split scene belong to which day. Now multiply that by two devices doing it at once. The system has to figure out which pieces ended up where and merge the results coherently.

The first attempt used a clever localized algorithm. It failed in subtle ways and we spent a week chasing edge cases. The eventual fix was to scan the entire list to determine the correct boundaries on every banner movement — slower in theory, robust in practice. The right algorithm is sometimes the boring one.

Bulk operations with foreign-key constraints. When the shot list gets repopulated from a script, hundreds of items get deleted and recreated in one batch. The database doesn’t love this. Each individual delete has to fire before its parent disappears, or the foreign key constraint complains — except the children themselves have foreign keys to other children, and so on.

We worked around this by deferring constraint checks until the end of the transaction, so the database tolerates the intermediate state as long as everything is consistent when the transaction commits. The lesson: when your domain has natural batch operations, your storage layer needs to know about them.

What it looks like today

Two people open the same shot list on different devices. One drags shots around. The other edits descriptions. Both see each other’s changes within a second of them happening — no flicker, no manual refresh, no merge conflict dialog. Each device’s UI reflects its own changes immediately; the outbox handles the network round-trip silently in the background. If one of them loses WiFi mid-edit, their changes queue locally and replay when the connection comes back.

The sync engine is the part of CineLog you’re least likely to notice. That’s the goal. The only time anyone thinks about sync is when it fails — and three versions later, it mostly doesn’t.

The lessons, distilled

Three things we’d keep — and one we’d push back on.

Transactions are the right unit of sync. Don’t sync documents; sync the discrete operations that produce them. We got this right from v1 and never regretted it.
Outbound and inbound have different shapes. Outbound is the user asking the server to do something — it tolerates latency, benefits from ordering, and needs ACL and conflict resolution at the boundary. Inbound is the server telling clients what happened — it needs to be fast, uniform, and broadcast-shaped. Trying to make one transport do both was where v2 hit its limits.
Optimistic UI with a local outbox is the right pattern from day one. Don’t make the user wait on the network. Apply locally, queue the work, reconcile in the background. It costs more upfront, but every feature you build on top of it is faster as a result — and it puts you most of the way to an offline-capable architecture without ever giving up the server’s role as the source of truth.

The one we’d push back on is the instinct to say “we should have skipped straight to v3.” Maybe we could have. But the design that became v3 was shaped by everything we saw users do under v1 and v2 — the bulk operations that bottlenecked the WebSocket, the ACL cases that emerged once teams started sharing projects, the concurrent-edit races we’d never have anticipated from the comfort of a planning session. v1 and v2 weren’t false starts. They were the way we earned v3.

Next: Making a Film Production App Feel Right on Every Device — making all of this feel native on the device you happen to be holding.