left arrow Back to posts

Finding and fixing eventual consistency with Stripe events

Anthony Accomazzo
5 min read
main image of the article

We’re Sequin. We stream data from services like Stripe to messaging systems like Kafka and databases like Postgres. It’s the fastest way to build high-performance integrations you don’t need to worry about.Stripe's API has some nice feature, and at first glance it looked like it would be easy to do change detection. But things weren't as easy as they seemed! (You can skip the API if you want.)

At Sequin, the backbone of our syncing infrastructure is polling. This is because polling provides stronger consistency guarantees than webhooks.

As we've written about, when you use webhooks, you give up some control: webhooks are ephemeral. If your service is down or you mishandle a webhook you receive, you're out of luck. You're also at the whims of the webhook provider. They might drop a webhook altogether, meaning you'll never have a chance to process it.

Polling is not without its challenges, however. Besides the complexity of maintaining polling infrastructure, the hardest part about polling is cursoring or paging through a stream of events. When cursoring through API items, you need to traverse the list in such a way that you don't miss any items. (And, ideally, you don't repeat items often either, as that's inefficient.)

Cursoring is surprisingly hard, as most APIs don't make it easy to see what's changed in them.

Cursoring becomes extra hard if the API you're querying is eventually consistent. In an API, eventual consistency means that your result set is not stable – the results you get from a request can change the next time you make the same request. This adds a lot of complexity, as you have to write defensive code.

Stripe events

Stripe is one of the rare API providers that has thoughtful solutions for change detection. They have a dedicated /events endpoint where they publish most of the changes that happen in their system. Examples include an event for every time a customer is created, a subscription is updated, or a new payment goes through.

We've been happy consumers of Stripe's /events and want to see more endpoints like it across other APIs.

However, due to the demanding nature of our real-time sync, we poll the /events endpoint frequently – multiple times per second. This means we're susceptible to even the slightest eventual consistency issues. And, indeed, we found a situation with the /events endpoint.

I'll give some background on the /events endpoint, discuss the issue we encountered, then tell you how we're mitigating the issue.

Paginating events

Most Stripe objects have a created property. This property is a Unix timestamp in seconds.

As a result, there are many Stripe events that will share the same created timestamp in a given Stripe account. For example, certain Stripe operations cause many Stripe records to be created at the same time. When a customer signs up for your service and starts a new subscription, Stripe creates a bunch of objects like a customer and subscription for that customer.

Normally, if we were cursoring Stripe's API with a created timestamp, this could be a problem. For example, consider this simplified HTTP query:

GET api.stripe.com/v1/events?createdAfter=${cursor}&limit=100

Using created > cursor would be a problem because we could easily skip any other events created in the same timestamp. Likewise, this could be a problem as well:

GET api.stripe.com/v1/events?createdAtOrAfter=${cursor}&limit=100

Here, using created >= cursor we'd have the potential of getting stuck on a page where every event had the same created timestamp – there would be no way for us to move forward.

Fortunately, Stripe lets us cursor by the event's ID. We can make a request to get some stream of Stripe events, like this (for brevity, I'll include just the id and created properties of each event):

GET api.stripe.com/v1/events?ending_before=evt_1MoCivKddDnm8ttlZ19ZW52C

{
  "data": [
    {
      "created": 1679422378,
      "id": "evt_1Mo9g6KddDnm8ttlWtWxdDBt"
      # ...
    },
    {
      "created": 1679422373,
      "id": "evt_1Mo9g1KddDnm8ttlitN7Jl38"
      # ...
    },
    {
      "created": 1679422371,
      "id": "evt_1Mo9fzKddDnm8ttlPDyDsUez"
      # ...
    },
    {
      "created": 1679422292,
      "id": "evt_1Mo9eiKddDnm8ttlip9sgaB5"
      # ...
    },
    {
      "created": 1679422292,
      "id": "evt_3MgXZgKddDnm8ttl09DxjuvM"
      # ...
    }
  ]
}

The list of events is returned sorted by created descending. So, the most recent event in the list is on top. Assuming we're paginating through the stream from oldest → newest, to continue pagination, we'd pluck the event ID at the top (evt_1MoCivKddDnm8ttlZ19ZW52C) and send that along as our ending_before to continue forward.

One odd thing to note is that the event IDs themselves are not strictly ordered. Note that the last event in the list begins with evt_3 which is "greater than" the event above it, evt_1. We'll discuss this more in a bit.

Missing events

We had some customers report missing items in their sync. This kicked off an investigation. We logged every request and response to and from Stripe. We then ran audits comparing the state of our synced database over time to the state of Stripe's API.

When our audits caught a missing item in our database – say, a missing Stripe subscription – we had the full trail of evidence to determine how we got there.

Our investigation revealed: the /events endpoint is eventually consistent!

Eventually consistent /events

Here's the behavior we observed: We make a request to Stripe with some event ID, say evt_0. We get back a list of 3 events. For brevity, I'll just include the id and created properties of each event. To make the created timestamps easier to read, I've formatted them into human-readable strings:

[
    {
      "created": "12:07:00",
      "id": "evt_3"
      # ...
    },
    {
      "created": "12:05:00",
      "id": "evt_2"
    # ...
    },
    {
      "created": "12:00:00",
      "id": "evt_1"
    # ...
    }
]

Given this response, our next cursor becomes evt_3. So, we make that request and get back the following events:

[
    {
      "created": "12:07:01",
      "id": "evt_7"
      # ...
    },
    {
      "created": "12:07:01",
      "id": "evt_6"
    # ...
    }
]

Problem is, at the 12:07:00 timestamp, evt_3 wasn't the only event that occurred. There are two other events, evt_4 and evt_5 which were not present in the first response. For some reason, when we used evt_3 to get our second response, the stream started at evt_6 – which occurred at 12:07:01, the second after the batch of events took place.

We can see this play out in our historic request/response logs. Yet, when we replay the request later with evt_3, we do get back evt_4 and evt_5 in the response!

This suggests there's something eventually consistent about Stripe's /events API. If we paginate through the endpoint using Event IDs, we're subject to skipping events. And because we query Stripe's /events endpoint multiple times per second, we're especially vulnerable to this issue.

How is this happening?

We're not sure why this is happening. We've confirmed it can happen when events are created in the same second, but haven't ruled out it happening in other situations.

One theory we have: some Stripe event IDs are prefixed with evt_3xxx and others with evt_1xxx. These IDs seem to correspond to what object the event is enveloping. For example, events for payment_intent and charge always have an event ID with evt_3xxx. It's possible that these objects are generated in a separate system that have their own ID generator. This could explain the objects potentially reaching the /events endpoint out-of-order.

Solution

To mitigate this issue, we're changing our cursoring logic. After receiving a response, to determine our cursor for the subsequent request, we follow a simple algorithm:

  1. If the created value on the latest event is more than 5 seconds in the past, update to use that cursor.
  2. Otherwise, do not update the cursor. Instead, use the same cursor we just used in our next request.

This means we'll "see" the same events over several requests. And for very busy Stripe /events endpoints, it could mean we add a few seconds of latency, as we might always be running just a tad behind the present. But the improved consistency guarantee is worth it.

Without knowing the root cause, we can't be sure how much mitigation we'll need to resolve this issue. We'll update this post after we've run this algorithm in production for a bit and had a chance to measure drift.

In general, finding out what's changed in an API is an extremely common requirement for engineering teams. Eventual consistency makes this task very difficult. When designing your API, consider how you can use strategies that will make your API consistent and predictable.