Skip to main content

Return to TX Community

Liaison

Root Cause Analysis – January 24 Email Broadcasts

Purpose

This RCA will outline the root cause, impact, and mitigation steps taken to address the email service delay occurrence, which impacted customers from January 28th, 2024 through February 1st, 2024.

Affected Time Period

Clients experienced email delays from the evening of 1/28/2024 through 2/1/2024

Impacted Audience

All TargetX Email clients.

Summary of the Issue

Continually increasing traffic on the TargetX Email service pushed the allocated hardware to its limits. The issue began the evening of Sunday, January 28 and the lag worsened into Monday morning as email broadcast traffic increased. By this point clients were experiencing a 3-4 hour delay between scheduled and successful send times. This occurred because the database server had reached the existing limit for handling concurrent broadcasts, recipient activity uploads, and other services essential to Email product functionality.

The Development team then took steps to increase the database server’s hardware size and upgraded the underlying database engine. Following this upgrade, services appeared to be running normally for several hours, after which an additional issue was encountered. The service responsible for querying upcoming broadcasts began to experience errors, and ultimately stopped. To keep broadcasts processing, our Development team implemented round the clock monitoring to relaunch the service when it fell into an error state.

During this time, a limited number of customers were impacted by broadcasts sending more than once. These repeated sends were caused by manual restarts occurring prior to initial broadcasts completing.

This issue was determined to be caused by a problem with a critical query which led to an inefficiency that caused it to be unable to complete within the allotted time limit. Additionally, mechanisms put in place in the database to increase efficiency of queries which determine broadcast processing order were no longer being applied. Once this was uncovered, the queries in question were updated to force the use of the correct database mechanisms and the system was restored to the expected functionality.

Root Cause

The incrementally increased volume of email broadcasts relative to current allocated hardware capacity, along with subsequent steps taken to improve hardware capacity contributed to the root cause of delay in sending emails.

Issue Resolution

Our development team increased the database server's capacity for sending broadcast email. Changes were also made to enforce use of essential database mechanisms for query efficiency, which were updated on the newly created database infrastructure to ensure processes are completed within an acceptable period. As all automated processes were then observed to be running, no further manual intervention was needed, which ensured no duplication of sends.

Root Cause Mitigation

We have implemented more detailed process monitoring for the email services in question. Our team will also be regularly reviewing the services which interact with the database to increase reliability and efficiency.

 

  • Was this article helpful?