Incident date: 15 Feb 2021, The BigInt Disaster

Summary

Note: All times are in Asia/Kuala_Lumpur (UTC +0800)

This document is a technical explanation of the incident on the 15th of February 2021. It serves as an avenue for the community to learn and understand the root cause of the incident. 

On the 15th of February, users were unable to access the merchant dashboard and make payments. Services intermittently were unavailable for approximately 5 hours. The issues occurred due to slow queries that escalated to other services. We changed the query structure to improve performance. Affected services returned to normal by 3:47 PM, 15th of February 2021.

We recently upgraded certain data types in our database to optimise for larger datasets. The impacted query was part of the new data types.


Timeline

The incident occurred between 05:30 AM, 15th of February 2021, to 3:46 PM, 15th of February 2021.

🕔 05:33 AM

Start of the incident. First symptoms, email notifications from our service provider on higher response time. The majority of the services were still up but with degraded performance. 

🕢 07:30 AM

The on-call person started the triage procedure. All engineering representatives alerted to the notifications. Hard restart and new instances provisioned to clear up memory as the temporary mitigation while understanding the root cause.

🕥 10:46 AM

The root cause identified and confirmed; slow queries impacted the response time and propagated throughout the services. The first hotfix deployed to Billplz main service after extensive QA processes. The main service went back online. At this point, users were able to continue making payments.

The investigation continued on the dashboard service.

🕧 12:49 PM

A similar hotfix deployed to dashboard apps, and the server was up. We continued to monitor the performance.

🕞 3:47 PM

We confirmed all services back to normal. Incident closed.


Lessons learnt

✅ What went well

  1. We were alerted to the outage by automated bots before the webserver started to crash.
  2. Database records are not corrupted due to migration to BigInt.
  3. CPU usage and memory usage were normal, and there was no other issue that complicated the resolution process.

⚠️ What went wrong

  1. Postgresql behaved differently when dealing with sorting on joins table after changing the column type to BigInt by ignoring the available index.
  2. Postgresql didn’t use proper indexes until vacuum and analysis were performed resulting in a nested loop where a longer time required for the sql execution to be completed.

🙏 Where we got lucky

  1. Our database servers did not crash and no data was lost during the period.
  2. There was not much ActiveRecord query to fix as the problematic query was only defined a few times in the source code.


Action items

💪 Technical improvements

  1. To ensure joins with large foreign tables do not have sorting in place.
  2. Depending on a case-by-case basis, performing separate queries may result in faster performance.
  3. Using an ActiveRecord has many relations through option results in tables being queried with joins. The relation needs to be used with caution to prevent case (1). 
  4. To ensure running of the vacuum and analyze every time changes are made to the table to ensure postgresql from using the correct indexes.
Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.