An update on our recent service disruption

An update on our recent service disruption

Pedro Canahuati by Pedro Canahuati on

On April 27th, 1Password experienced a brief service outage owing to an internal code issue – it was not a security incident, and customer data was not affected in any way.

1Password is designed to protect your information at all costs, with local copies of vault data always available on your devices – even without a connection to the 1Password service or the internet itself. As a result, your passwords and other vault items remain safe and sound.

We’re sorry for any disruption this outage may have caused and deeply appreciate your patience during our investigation. Service has been fully restored, and we can now share further details about what happened and how we’re working to avoid similar situations in the future.

What happened?

On April 27th, our scheduled maintenance included an upgrade to our database aimed at improving performance.

Although the upgrade itself was successful, the improvement had unintended consequences. It revealed that certain queries weren’t optimized for the new performance characteristics of the database, leading to unexpected behavior that ultimately destabilized the system.

This behavior only occurred under specific circumstances that didn’t emerge in our test environments.

As a result, we saw a temporary service disruption that impacted syncing data across devices, access to administrative interfaces, new account signups, and performance of the 1Password Connect server.

Our team quickly identified the underlying issue and deployed a fix. After additional testing, we can confirm that all systems are back to normal.

What did we do?

Last year, we identified some performance improvements we could gain from upgrading our databases to the latest MySQL version.

We spent months running tests to ensure that all our services, code, and infrastructure could be smoothly transitioned to support the newer MySQL version. Finally, as the day arrived for us to upgrade, we had a solid plan and executed the transition during a scheduled maintenance window.

On the morning of April 27th, as we entered a period of heavier traffic, we noticed a large number of database connections remaining open, with queries not completing efficiently. We spent some time debugging and theorized that the increased connections were due to inefficient SQL queries resulting in lock contention. This eventually led to us bumping up against connection limits.

We immediately scaled down the service that keeps data in sync between devices to alleviate some of the load and allow our services to recover.

With our new hypothesis in play, we optimized the queries, built new versions of our services, and deployed them to our production environment. We then scaled our database instances above what we had initially provisioned to account for the increased load we would see as the sync service caught up.

We closely monitored service health and stability over the next 24 hours as we prepared for the next day’s peak load. By April 28th, everything was still running smoothly. Although we saw an initial increase in connections as sync requests resumed, things quickly stabilized and we were able to confirm that the fixes were working as expected.

What happens next?

We care deeply about our customers, their data, and their experience, so we take any service disruption like this very seriously.

As part of our plan to avoid similar incidents in the future, our immediate next steps are to spend more time analyzing the data we collected to ensure we have a full understanding of the underlying causes of this incident. This analysis will contribute to a refinement of our testing procedures and capacity planning to ensure we properly account for these scenarios.

We take the integrity of your data and the stability of our systems very seriously and will continue to work hard every day to earn the trust you’ve placed in us.

Chief Technology Officer

Pedro Canahuati - Chief Technology Officer Pedro Canahuati - Chief Technology Officer

Tweet about this post