We’re Sequin. We let you build apps and workflows on top of APIs like Stripe and GitHub using just SQL and Postgres. Changes in the API sync to your database and changes in your database sync to the API. So – we know a thing or two about syncing Stripe! This post is about what we learned building our real-time sync.
Last updated: 11/17/2023
Stripe's API is great, so we knew our Stripe support would have to be very fast and reliable on day one. In order to replace API reads, our sync needs to create a database that is truly a second source of truth. This means:
- We need to backfill all data: On initial sync, we load in all historical data for a given Stripe account into the target Postgres database.
- Support virtual objects: Stripe has a few "virtual" objects, like "upcoming invoices." These objects are in constant flux until they are created (eg an upcoming invoice becomes an invoice.) You have to fetch these one-off as there's no place to paginate them. They don't even have primary keys.
- Writes: It's not enough just to sync data to local tables. You need a way to update that cache along with updating Stripe. That way, you can confidently read-after-write.
Two primary sync strategies
Our Stripe sync orbits around Stripe's
events endpoint. This endpoint serves the same purpose as a replication slot on a database. It contains a list of all create/update/delete events that have happened for a given account on Stripe.
Each event contains a full payload of the affected record. We can use this event stream to effectively playback all changes to a Stripe account.
However, as you might expect, the endpoint does not contain an unbounded list of all events in a Stripe account ever. It contains data from the last 30 days.
So, this means that when our customers start up a new replica Postgres database, we need to backfill it with historical Stripe data first. Backfilling just means paginating each and every endpoint back to the beginning of the Stripe account.
What we ended up with was two distinct sync processes: Our backfill process and our events polling process.
We run the backfill process first to build up the initial database. Then we run the events polling process continuously over the lifetime of the database to keep it in sync.
Sync process: The backfill
During the backfill, we need to paginate through the full history of each endpoint on Stripe.
Given the breadth of the Stripe API, there are a few of challenges the backfill poses:
- We need to make requests to dozens of API endpoints.
- Then, for each endpoint, we have to convert the JSON response into a structure that's ready to insert into Postgres.
- Further, each response can contain several layers of nested children. Those children can be lists of children which are, in turn, paginateable.
This was a great excuse to use Elixir's Broadway. A Broadway pipeline consists of one producer and one or more workers. The producer is in charge of producing jobs. The workers consume and work those jobs, each working in parallel. Broadway gives us a few things out of the box:
- A worker queue with back-pressure.
- We can dynamically scale the number of workers in the pipeline based on the amount of work to do.
- We can easily rate limit the volume of work we process per unit of time. We tuned this to stay well below Stripe's API quota limit.
- A "message" construct with acknowledge/fail behaviors. This made things like retry logic trivial.
In our case, the queue of work the producer maintains is a list of pages to be processed. A page is the combination of an endpoint and the current cursor for that endpoint. Here's a small example:
queue = [
To configure throughput, we just instantiate Broadway with a few parameters:
options = [
rate_limiting setting is all we need to ensure we process no more than 50 pages per second. This leaves a comfy 50 requests per second left over in a customer's Stripe quota.
processors, we specify that we want up to 50 concurrent workers and that each may request one unit of work per time (in our case, a page).
So, to kick off the sync, the backfill producer's queue is seeded with all Stripe endpoints (and
nil cursors). Our workers checkout a page to work and fetch it. Each page contains up to 100 objects. Each of those objects can contain a list of paginateable children. As such, the worker's first job is to populate all objects in the page completely.
Once we have a list of "filled out" objects, we parse and insert them. We use a large JSON object which maps object types and their fields to tables and columns in Postgres. We benefit greatly from the fact that every Stripe object contains an
object field which identifies what the entity is.
Sync process: new events
After the backfill completes, it's time to switch to processing events for the indefinite lifetime of the sync. But we need a smooth hand-off between the two, otherwise we risk missing a change.
To facilitate the hand-off, before the backfill begins we make a request to
/events to grab the most recent cursor. After the backfill completes, we first catch up on all /events that occurred while we were backfilling. After those are processed, the database is up-to-date. And it's time to poll
We poll the
/events endpoint every 500ms to check to see if there’s anything new to process, continuously. This is how we can promise "sub-second" lag.
We log sync completions to a Postgres table. We use the "polymorphic embed" pattern, where each log entry contains a JSON
payload that can take one of several shapes. For example, our "Stripe backfill complete" log looks like this:
When we boot a Stripe sync process, it checks the sync logs table for the most recent completed sync for this database. Our sync manager then knows what kind of sync process we need to boot and the initial state of that process.
What about webhooks?
When one hears about a "real-time" or "evented" API integration, the first API primitive that leaps to mind is "webhooks."
- You can't go down: Senders typically retry undelivered webhooks with some exponential back-off. But the guarantees are often loose or unclear. And the last thing your system probably needs after recovering from a disaster is a deluge of backed-up webhooks to handle.
- You're counting on sender deliverability: When polling, the only real barrier between you and the latest data is a possible caching layer. With webhooks, senders will typically have some sort of queue or "outbox" that their workers work through. Queues like this are subject to back-pressure. This opens you up to your sync slowing down if your sender's queue backs up.
- A redundant system: Webhooks are not something we can rely on exclusively for a syncing operation like this, so they'll always need to be complemented by a polling system. We have to poll to backfill the database after it initializes. And we may have to poll after recovering from downtime or after fixing a bug in our webhook handling logic.
In general, I have this suspicion that a system that relies purely on webhooks to stay in sync is bound to fail. All it takes is for one webhook to get dropped, on either the receiving end or sending end. With no other backup mechanisms in place, you risk a record in your database being out of sync indefinitely.
Luckily, it turns out that with an
/events endpoint to poll, webhooks are not necessary. The trick is to just poll it frequently enough to get as close to real-time as possible! What's great is that you can use the same sync system to get a change made milliseconds ago or to catch up on all changes that happened during unexpected downtime.
With backfills and events driving your sync, the last thing you need is cache invalidation whenever you write to Stripe. This can eliminate issues with "read-after-writes," or situations where you write to Stripe but read from your cache/database before that updated data has synced from Stripe.
In a great sync, reads and writes work together to keep local state up-to-date.
We packaged all this up and more in our Stripe sync. You can even write back to Stripe using
delete via your database!