Become a member

Get the best offers and updates relating to Liberty Case News.

― Advertisement ―

spot_img

23 Cat Memes Barging in Uninvited to Bring You That Fluff and Laugh

It doesn't matter if the door was closed, because if your cat wants to go through - they'll go through. One way or another....
HomeStartupsEvolving Intercom’s database infrastructure: Lessons and progress

Evolving Intercom’s database infrastructure: Lessons and progress

A few months ago, we launched an overhaul of Intercom’s database architecture with Vitess and PlanetScale. Already, we’re seeing transformative results: zero-downtime maintenance, 90%+ performance improvement for our most demanding queries, and a more resilient infrastructure that scales in days, not months.

In our previous post, we shared how we’re overhauling Intercom’s database architecture to scale better, reduce complexity, and improve reliability. We are moving away from Amazon Aurora and our custom sharding solution, and adopting Vitess, managed by PlanetScale. We outlined why we made this shift, the challenges we faced in our legacy setup, and the benefits we expected.

Since then, we’ve migrated critical databases, optimized query performance, and tackled some of our toughest infrastructure bottlenecks. Most significantly, we’ve adopted PlanetScale Metal, a new storage architecture that has improved performance, reduced costs, and simplified our operations.

This post is a status update on where we are today – what’s working, what we’ve learned, and what’s next.

Recap: Why Vitess and PlanetScale?

Our decision to adopt Vitess and PlanetScale was driven by five key goals:

  1. Improving availability
  2. Eliminating downtime during maintenance.
  3. Reducing engineering complexity
  4. Streamlining table migrations.
  5. Achieving straightforward scaling.

To achieve this, we needed a database that could scale seamlessly while keeping things operationally simple. Vitess solves this by bringing built-in sharding, connection pooling, and zero-downtime migrations and upgrades, all while remaining MySQL-compatible.

Running Vitess ourselves would have required a massive investment in building operational expertise. PlanetScale, with its managed platform and deep Vitess expertise, offered an exceptional developer experience and handled the heavy lifting of infrastructure management.

Progress so far

Over the past few months, we’ve successfully migrated several of our most critical databases, including those powering Intercom’s AI infrastructure, Contacts, and our Inbox – one of the most latency sensitive parts of our system.

Each migration has validated our decision and reinforced the benefits of the new architecture. These migrations apply to our US hosted customers region. Our EU and AU hosted regions will follow soon, but since they are not currently affected by database scaling performance issues, we are prioritizing the US migration first.

Key successes and optimizations

Despite some initial challenges outlined in our original post, we’ve seen numerous successes and optimizations:

Sharding in days, not months

Previously, sharding even a single table took months and required careful coordination, manual data migrations, and application-layer changes. With PlanetScale, we’ve sharded multiple databases in a matter of days, freeing up resources and dramatically improving query performance.

Massive query speedups with materialized views

Some of our most expensive queries on Aurora were rewritten using Vitess materialized views, leading to 90%+ performance improvements. These improvements especially benefit our largest customers who generate the most data.

Connection management is no longer a headache

Vitess’s VTGate component has simplified connection management, eliminating the 16,000 connection limit per MySQL host imposed by Aurora and removing the need for ProxySQL, our connection management middle layer.

This has reduced complexity and potential points of failure. For example, the database that powers the Inbox would routinely require 135,000 active connections – we can now do this without additional infrastructure.

Zero-downtime maintenance

Vitess’s failover mechanisms have enabled us to perform maintenance without any customer downtime. Almost all scheduled maintenance operations in Intercom in the past have been for critical database maintenance on Aurora, so being able to make these changes without disrupting our customers is a huge win.

We’ve been upgrading databases every couple of weeks, and no one has even noticed. That’s exactly how it should be.

PlanetScale Metal: A game changer

Early in our migration when we were migrating a database critical to the Inbox, we ran into a major performance bottleneck: disk I/O saturation on database replicas. This led to incidents in June and July 2024 where we had degraded or slow performance of the Inbox.

We originally used EBS-backed storage (GP3 volumes) for our PlanetScale databases. But at peak load, replicas maxed out their IOPS allocation, causing increased CPU wait times and degraded performance. Scaling out was an option, but provisioning new replicas and catching them up with replication took hours. During this time, core operations like Inbox performance or Load Balanced Assignment suffered.

We learned the hard way: scaling under pressure is slow, expensive, and disruptive. Our immediate fix was to overscale and upgrade to high-IOPS EBS io2 volumes.

“We needed a better way. That’s where PlanetScale Metal changed the game”

This experience made it clear that our storage layer was an expensive bottleneck. We needed higher throughput and lower latency, but without the operational headaches of constantly tuning IOPS provisioning.

We needed a better way. That’s where PlanetScale Metal changed the game. PlanetScale Metal uses locally attached NVMe drives and removes the need for the slower network-attached storage (EBS).

This has translated into immediate and substantial benefits for Intercom.

Dramatic performance improvements

We saw immediate and significant improvements in tail latency. This performance boost has benefited all customers, with queries running faster and more consistently, even during peak load.

This heat map shows the moment Metal went live, showing the speed of loading conversations in the Intercom inbox – Metal significantly reduced tail latency.

Operational stability

Simply put, every database we’ve migrated to PlanetScale Metal has run smoothly. We’ve had no availability issues caused by the migrated databases.

Significant cost reduction

By switching to Metal, we’ve achieved a 60%+ reduction in cost compared to our previous EBS io2 volumes. Metal provides even greater IO capacity and improved latency at a fraction of the cost.

Here we can see the moment Metal went live reflected in the hourly cost of the database.

Smooth database migration

The database that powers Contacts is one of our highest throughput databases. So much so that when we had to upgrade it from Aurora 1 to Aurora 2 in 2022, it took six months of engineering effort for that single database. With PlanetScale Metal supplying the throughput, and the lessons already learned, we were able to test and then migrate it safely with no impact to production in a couple of weeks.

Ensuring durability on PlanetScale Metal

PlanetScale Metal databases offer exceptional durability through semi-synchronous, row-based MySQL replication to a minimum of three replicas spread across three availability zones, ensuring every write is securely persisted before acknowledgment.

All databases that have been migrated to PlanetScale now run on Metal. All databases we’re actively moving across will only ever use Metal.

The road ahead

While we’ve made huge strides, the biggest challenge is still ahead: migrating our two largest legacy databases, which powers nearly every part of Intercom.

These databases still run on our custom-sharded Aurora setup where we’ve had an unacceptable increase in availability issues in recent months. A significant driver of that has been us stretching the limits of what Aurora is capable of delivering for us. This migration to PlanetScale will move hundreds of terabytes of customer data.

Based on everything we’ve learned, and extensive testing already performed, we’re confident this transition will be smooth and result in a step-change in performance and stability.

We’ll share more updates as we move forward.

Intercom Blog CTA Careers Horizontal

BFY_Blog Ad_Vertical_Spring 25

Source link