How we built our sync on HubSpot's API

Anthony Accomazzo

•

Mar 28, 2023

•

13 min read

We’re Sequin. We stream data from services like HubSpot to messaging systems like Kafka and databases like Postgres. It’s the fastest way to build high-performance integrations you don’t need to worry about. HubSpot's API presents particular challenges around detecting changes and managing rate limit. Below, we'll dig into how we did it.

HubSpot is a leading platform for marketing, sales, and customer service. For companies that use HubSpot heavily, many operational workflows originate in the platform. Critical data about their customers pass in and out of HubSpot and their core services.

Like other CRMs, HubSpot allows for a lot of customization. With custom objects, you can create tables in HubSpot that match your domain. With associations, you can connect related objects together in meaningful ways, such as linking contacts to their respective companies or deals.

This means organizations often "blur the line" between their database and their HubSpot data. Records often need to move seamlessly from HubSpot to their application – and back again.

HubSpot has an API to facilitate reading data from and writing changes to their platform.

One of the most common use cases of an API is figuring out what's changed. As a developer, you need to find out what happened in an upstream API so that you can make updates to your local data or trigger side effects.

Figuring out what's changed in HubSpot is challenging because:

There is no /events endpoint that lists changes.
There are no webhooks for custom objects.
There is limited queryability around associations.
Query support for deleted objects is incomplete.
The Search API is eventually consistent.

HubSpot is one of the sources we sync at Sequin. Detecting changes (often called change data capture) is the backbone of any sync process. So, we had to figure out how to overcome these challenges to detect changes in HubSpot to power our sync.

In this post, I'll break our strategy down. Our sync process consists of three primary parts, which I'll step through in order:

Schema introspection, where we determine the list of syncable objects from HubSpot and their schemas.
Backfilling, where we paginate over the entire HubSpot instance to grab historical data.
Syncing, where we poll HubSpot for changes.

Introspecting the schema

HubSpot has some reflection APIs you can use to list the objects and custom objects in a HubSpot instance. You can also describe the fields of those objects.

To list the schemas for all the custom objects in your instance, call GET /crm/v3/schemas. The response contains all the info you'll need to determine how to parse JSON objects for use in your code or database. It contains the full list of properties for each object and the type of each property (e.g. text or datetime):

{
  "results": [
      {
          "labels": { "singular": "Car", "plural": "Cars" },
          "requiredProperties": ["year", "vin", "model", "make"],
          "searchableProperties": ["year", "vin", "model", "make"],
          "primaryDisplayProperty": "model",
          "secondaryDisplayProperties": ["make"],
          "archived": false,
          "id": "7171718",
          "fullyQualifiedName": "p82918_car",
          "properties": [
            {
              "updatedAt": "2022-05-26T00:13:43.786Z",
              "createdAt": "2022-05-26T00:05:17.903Z",
              "name": "color",
              "label": "Color",
              "type": "string",
              "fieldType": "text",
              "description": "",
              "groupName": "car_information",
              "options": [],
              "updatedUserId": "44561081",
              "displayOrder": -1,
              "calculated": false,
              "externalOptions": false,
              "archived": false,
              "hasUniqueValue": false,
              "hidden": false,
              "modificationMetadata": {
                "archivable": true,
                "readOnlyDefinition": false,
                "readOnlyValue": false
              },
              ...
      }
  ]
}

Curiously, only custom objects are listed here, not core HubSpot objects. To get the schema for core objects like Company or Contact, you'll need to call the dedicated schema endpoints for those objects. Here's the list of those objects and endpoints:

Company @ /crm/v3/schemas/companies
Contact @ /crm/v3/schemas/contacts
Deal @ /crm/v3/schemas/deals
Line item @ /crm/v3/schemas/line_items
Product @ /crm/v3/schemas/products
Quote @ /crm/v3/schemas/quotes
Ticket @ /crm/v3/schemas/tickets

Finally, some objects don't seem to be present anywhere. For example, there is no schema endpoint for a HubSpot Pipeline. For these objects, your best bet is to determine their schema definition manually.

Backfilling

A very common workflow is backfilling or paginating over an entire dataset. Backfilling is integral to setting up a sync on top of an API – you'll need to pull in the data that was created prior to your sync going live.

We decided to use HubSpot's Search API for backfilling. This felt like the best way to paginate through the history in a stable way that would ensure we didn't miss anything.

The Search API

To use the Search API for stable pagination, you need three components:

A query filter
The list of properties you want returned
A sort parameter

You'll send them along in the JSON body of your Search API request like this:

POST /crm/v3/objects/{object_name}/search

{
  filterGroups: [ /* ... */ ],
  properties: [ /* ... */ ],
  sorts: [ /* ... */ ]
}

filterGroups

HubSpot's query filters match a specific format. At Sequin, we use a query filter that looks like this:

"filters": [
    {
      "propertyName": "createdate",
      "operator": "GTE",
      "value": {cursor}
    }
]

This essentially constructs a query that looks like this:

createdate >= {cursor}

The propertyName is the field that you want to filter on. Here, we specify the createdate field. Unfortunately, not every object has a createdate field – for many non-custom objects (but not all), this field is hs_createdate.

The operator specifies GTE or "greater than or equal." And value is the timestamp we're comparing to. We start with 0, then use the latest createdate in the response to compute our next cursor.

properties

By default, the Search API returns only a few properties on the objects it returns. You'll need to specify which properties you want back in your response. There doesn't appear to be a limit on the size of this parameter, so you can just set properties to a list of all properties on the object.

sorts

For backfills, we use a sorts parameter like this:

{
  "propertyName": "createdate",
  "direction": "ASCENDING"
}

Again, propertyName will vary based on the object – many objects instead call this property hs_createdate.

Together, these three parameters in our Search API request ensure that we're able to paginate the full table of each HubSpot object without missing any records.

This strategy alone suffers from one fatal flaw: you can get stuck. If you get back a page of HubSpot records that all have the same createdate, you can't move forward with a GTE operator – you have nothing to increment your cursor with! So, we do some extra filtering and sorting that incorporates the ID of records as well to "get through" the deadlock.

This is only usually a problem on the busiest HubSpot instances. If absolute consistency isn't an issue for you, you can also consider just using a GT operator to be safe. While you might miss objects, that operator will never get you stuck in the stream.

Data type issues in the Search API

We've seen some strange values come back from HubSpot's Search API. Once in a while, we'll get back an integer in a date field or a string in a numeric.

We're still not sure where these come from. We try to recover the original type if we can, but usually our system ends up needing to just throw the values out. They're rare enough that we haven't been able to get to the bottom of what causes them to appear.

Syncing changes

Backfilling is a one-off process that we run when a sync is first established or when a re-sync is kicked off (in the case that a customer of ours e.g. adds a new column to their synced table.)

Syncing changes is far more critical – and tricky. While our sync could just use the backfill process over and over again to keep HubSpot and Postgres in-sync, this would be wildly inefficient. Large HubSpot instances can take many hours to backfill.

Instead, we need to query HubSpot in such a way that we can efficiently see what's changed from one request to the next.

As noted in the beginning of this post, this is uniquely challenging with HubSpot. Change detection is not standardized across the API. For example, webhooks are only supported on six objects. And some changes are not possible to detect at all.

We use the Search API to detect creates and updates. We use the standard list objects endpoint for deletes, where supported. And we use a combination of endpoints to create a synthetic change stream for associations.

Querying the Search API for changes

The Search API is the primary API for finding out what's changed in HubSpot.

As mentioned in "Backfilling," we use three parameters in our search requests: properties, filterGroups, and sorts. Our pagination strategy for syncing changes is similar to backfilling: use a timestamp to page through the changes. Except we're filtering and sorting on a different property:

"filters": [
    {
      // See note below on this property
      "propertyName": "lastmodifieddate",
      "operator": "GTE",
      "value": {cursor}
    }
]

Again, the propertyName to use here is not standard across objects – for some objects, it's hs_lastmodifieddate.

With this method, our system only needs to process records that were recently created or updated. We use GTE so we don't accidentally skip records in situations where multiple records were created at the same time. (This means we'll always "see" one record twice – once in request n and again in request n+1 – but the inefficiency is worth the consistency guarantee.)

One caveat with this method is that it's subject to getting stuck in certain situations with highly active HubSpot accounts, similar to our backfill method. (See "Backfills.")

With this strategy, you can detect creates and updates for objects and custom objects. You can't detect deletes. And associations are not supported in the Search API at all.

Detecting deletes

A common limitation of APIs is that they support seeing changes in records that still exist but there's no such change detection for records that have been deleted (they're no longer present anywhere for you to "see.")

HubSpot has this limitation as well, but not for all objects. Fortunately, for standard objects, they provide a way to retrieve deletes.

To detect deletes (or "archived objects"), you need to use the standard list objects endpoint:

GET /crm/v3/objects/{object}?after={cursor}&archived=true

We can paginate this endpoint with the after cursor. We get this in the paging.next.after property from the previous response. archived=true is how we tell HubSpot to give us deleted objects.

You might ask: If we can use this endpoint for grabbing archived objects, why can't we use this to do regular change detection as well? Can't we just paginate this endpoint to see which records have been created or updated, as opposed to the Search API?

Our concern is stability. We can't confirm the sort or stability of this endpoint. When an endpoint is dynamic (things are being updated and created and deleted between requests), a limit/offset pagination strategy can make it easy for you to accidentally skip records.

We can't use the Search API for archived objects, though. And archived objects are stable, so this strategy is safe.

A huge bummer, however, is that custom objects do not support the archived flag. And as we'll see, archived associations don't have a way to be retrieved either – but nothing is standard about associations anyway.

Associations

Associations are a key feature of HubSpot. On top of associating any object to another, you can add labels to your associations to create rich networks of objects.

However, associations are the most challenging part of HubSpot's API to sync. They are not supported to the same degree as objects in the API. You can't use the search API to find out which associations have been updated. And associations don't have created_at or updated_at timestamps on them.

We're hopeful HubSpot will improve their support for accessing associations. In the meantime, in order to sync them, you're forced to do a full sweep: you have to paginate through every association on a regular interval, inserting new associations and updating existing ones as you go.

Using the list objects API to fetch associations

There are two endpoints you can use to access associations.

The first option is to make calls to the standard list objects API. You can include the associations parameter, which will have HubSpot return the associations you list for each object returned:

GET /crm/v3/objects/companies?associations=CONTACT

The second option is to use the batch associations API. This endpoint will list up to 1,000 associations per object per call:

GET /crm/v4/objects/{objectType}/{objectId}/associations/{toObjectType}

We ended up using both endpoints in combination for our sync.

We use the standard list objects API to fetch a list of 100 objects and the associations for each of those objects. There's a catch, though: HubSpot only returns 100 associations per object. That means if you have an object with a lot of associations, you have to further paginate that object. If that object has thousands of associations, that could mean tens of API requests just to retrieve all the associations for that one object!

So, we then follow-up that request with zero or more batch association requests. For each object in our first request that had more than 100 associations, we make a batch association request to try to grab the rest. That will let us see 1,000 associations for that object at a time vs just 100.

As long as the count of associations per object follows a power law (only ~10% of objects have thousands of associations), this works reasonably well.

Mitigating inefficiency

The full sweep strategy is very inefficient. Sync time increases linearly with the count of associations. And any given request contains mostly associations we've seen before, meaning we're spending a lot of network and CPU retrieving and parsing JSON that contains no new information.

We want to contain this inefficiency as much as possible. The impact would be even worse if we blindly sent associations through the rest of our pipeline. If the database also had to process every association record we saw, we'd be paying the cost in multiple areas.

So to mitigate and isolate the inefficiency, we use an in-memory cache. The cache is essentially an md5 fingerprint that tells us if we've seen an association before or not.

Deletes

Our in-memory cache is helpful for another purpose: association deletes.

HubSpot provides for no way to detect if an association has been deleted. So, we use a "mark-and-sweep" strategy in our sync.

As we're paging through associations, we keep a record of all the associations that we've seen in this sync cycle. When we reach the end, we know which associations we should keep – and which we should flush. We can then issue a delete command to the database to drop the associations we saw in the last sync cycle but not in this one.

Eventual consistency

While the Search API is powerful and does most of what we want for syncing objects and custom objects, it has one major flaw: it's eventually consistent.

As we've written about, eventual consistency can be the bane of any integration. When an API is eventually consistent, you can miss creates, updates, and deletes without knowing it. Eventual consistency issues are hard to debug and hard to mitigate.

In HubSpot's Search API, objects can arrive out-of-order. To illustrate an example, I'll use a human-readable form of HubSpot's filterGroups syntax. I'll also use human-readable timestamps without a date.

Let's say you make a request to the search endpoint at exactly 12:00:10 with a cursor from just 10 seconds before, 12:00:00:

"lastmodifieddate" ≥ 12:00:00

This should return all records in the system that have been updated since 12:00:00. What we've observed can happen is that HubSpot can return a response like this:

[
  {
    "id": "some-id-1",
    "lastmodifieddate": "12:00:07"
    // ...
  },
  {
    "id": "some-id-2",
    "lastmodifieddate": "12:00:08",
    // ...
  }
]

Seeing this response, we assume that just two records have been updated since 12:00:00. For our next cursor, we can use 12:00:08. We grab that cursor, and continue on our way.

The problem: there was actually another record that was created/updated between 12:00:00 and 12:00:08. It's just not being returned by the Search API yet.

We can confirm this a minute later by repeating our original request, again asking for records created/updated after 12:00:00:

// one minute later, run this again
"lastmodifieddate" ≥ 12:00:00

And then we'll see a record appear this time, with a lastmodifieddate prior to 12:00:08:

[
  // New record appears!
  {
    "id": "some-id-0",
    "lastmodifieddate": "12:00:06"
    // ...
  },
  {
    "id": "some-id-1",
    "lastmodifieddate": "12:00:07"
    // ...
  },
  {
    "id": "some-id-2",
    "lastmodifieddate": "12:00:08",
    // ...
  }
]

Because when we initially made this request we updated our cursor to 12:00:08 to make our next request, we will never see this update.

Objects arriving out-of-order to a stream like this is no good. It means we can't increment our cursor with confidence, as we're not sure if a given page that we've retrieved has "settled" yet.

We've reached out to HubSpot for comment on this and will update this post when we hear back. Our guess is that the eventual consistency emerges because they're streaming changes from their database to some second store for search querying (like Elastic.) And must be doing so in such a way that does not guarantee strict ordering of changes.

We're still determining how long it can take the Search API to reach consistency. The safest option is to avoid cursoring up until the present. If you run your pagination a few minutes behind, you'll significantly reduce the chances you'll miss any changes.

For our sync, we wanted to continue to propagate changes in as close to real-time as possible. But consistency is of utmost importance to us. So we run two sync processes for each object that we sync: one on the "bleeding edge" of changes and another several minutes behind to catch stragglers.

This means changes can arrive out-of-order to our customers' databases, but this doesn't seem to be an issue. It's rare that it really matters that this contact was inserted before that one. What's most important is that both contacts make it into your database.

Rate limits

One final consideration when architecting a sync on HubSpot: rate limits.

The syncing strategies we've outlined can consume a ton of rate limit. You'll need to evaluate how this consumption interacts with any other integrations you have set up for your HubSpot account.

Fortunately, OAuth apps benefit from their own dedicated rate limit. This means that our solution can interact with our customers' HubSpot instances without interfering with any other integrations they may have configured to access their HubSpot API.

If you encounter rate limit issues with your own syncing process, you can explore options such as the "API add-on," which substantially increases the number of requests available for your HubSpot account. Alternatively, you can configure your app as a private OAuth app, granting it a separate pool of requests to draw from.

Conclusion

HubSpot's API poses some unique challenges for syncing. It's frustrating that their webhook support is so limited. If webhooks were supported comprehensively, we could use that to power the real-time nature of our sync. And then we could use polling the search API as a "sweep" operation to ensure we didn't miss anything. We could run that poll way behind the present to avoid eventual consistency issues.

Further, everyone would benefit if HubSpot better supported querying associations through their API. Associations are so vital to a CRM like HubSpot – they're the R in CRM! Improved access to associations would make things far easier for the developer. And would mean syncs like ours would impose far less load on their system. Sending countless gigabytes of unnecessary JSON through both of our systems feels tragically inefficient.

Finally, the differentiated treatment between objects and custom objects adds needless friction to adoption. Both in little areas, like not returning every object in the /schemas endpoint. And in big areas, like not being able to detect deletes for custom objects.

All that said, with a sync in place, having your HubSpot data in your local datastore provides a ton of benefits. You abstract API complexity away from the rest of your app. Remote calls become local ones, reducing cognitive overhead around guarding for failures or availability issues. You don't have to think about rate limits. Reads are lightning fast. And when your data is in a database, it's far more queryable–you can perform complex queries and analysis easily.