Increase Load Times/Errors When Logging In

Incident Report for Civitas Learning

Postmortem

What Happened

At approximately 7:25am Pacific time on Nov 2, 2020, the College Scheduler application experienced a spike in traffic correlated to the opening of Spring 2021 Registration. The application handled the traffic well until a sharp increase in volume within a 20-30 second span resulted in timeouts and subsequently HTTP errors (specifically 502 and 504 errors) to be displayed to student and staff users. This made College Scheduler slow to respond and soon after completely unavailable. Unfortunately this lasted for a period of approximately 6.5 hours as we worked to bring the system back online.

Why Did it Fail and Why Did Last So Long

The service responsible for authenticating users into the College Scheduler application became unresponsive during this event. Newly added service instances (identical virtual machines) were immediately deemed unhealthy by automated health checks, blocking our efforts to remedy the problem. Meanwhile, whereas a typical spike in login activity typically subsides as valid login “tokens” are issued and then reused in subsequent requests, this event resulted in a sustained level of invalid (expired tokens) login activity that compounded the number of requests our authentication servers were handling. No amount of effort to add resources could cope with the volume of queued requests that needed to be processed.

What Existing Process or Safeguards Failed

College Scheduler is heavily load-tested using real world data and an exact replica of production. This system is designed for burst loads of traffic, which we encounter every day during registration periods and throughout the year (e.g. during freshman orientation events). An example of such a burst is shown below, where between 6:58am and 7:00am the application experienced an almost immediate 400% increase in traffic.

With nearly 300 customers, it would be very challenging to track anticipated registration / enrollment dates for each institution and schedule around these bursts. As such, our approach is to replicate these burst registration conditions using our loadtest process in a controlled environment and scale for these events automatically. These tests have led to dozens of learnings and resolved many issues previously encountered during normal use of College Scheduler.

Still, our load tests are historically modeled on the assumption that login requests succeed and that subsequent activity is properly authenticated. The Nov 2 event, however, resulted in a large number of failed login events, in very quick succession. This illustrated that the application, in this instance, was susceptible to prolonged wait times which resulted rapid thread exhaustion. This was compounded throughout the day as inbound traffic to College Scheduler remained heavy.

Changes We Made to Fix

Once it was clear that the scaling procedures in place were not effective, we put the application into maintenance mode to try and stem the tide of incoming requests to the service. The maintenance page was in place for 14 minutes, which allowed enough time for previously allocated service instances to become healthy. However, after the maintenance page was lifted, the application became unstable again within 18 minutes. At this time, a change was made to expand resources approximately 7 times levels our highest anticipated levels of capacity. After this service allocation, the system stabilized and remained healthy while traffic levels subsided.

What We Are Doing to Correct the Root Cause

The College Scheduler system is being enhanced to more appropriately scale to the extreme levels of student traffic we see in registration periods. Specifically, our system is being tuned to ensure a more rapid scaling speed combined with more proactive pre-allocation of resources leading up to registration season. We have already replicated the Nov 2 issue in a controlled environment and are working through our autoscaling improvements now. In the interim, our infrastructure has remained purposefully over-allocated to greatly exceed the capacity needed to respond to these spikes as registration season continues.

We deeply regret the downtime that occurred on Nov 2 and have been working diligently to understand the issue deeply and to further understand why our load testing and production systems were not tuned for this event. Rest assured we are striving to be better and to make sure that you can trust Civitas Learning with your next registration period.

Posted Nov 07, 2020 - 02:17 CST

Resolved

Systems are fully operational. We continue to analyze the event as well as monitor for any further degradation of performance or increased error rates.

Posted Nov 02, 2020 - 17:37 CST

Update

We continue to see degraded system response times and intermittent 502 and 504 responses. The system is scaling to handle the additional traffic in an attempt to reduce response times. Latency and error rates remain above acceptable thresholds and we will continue to modify the infrastructure until these numbers improve.

Posted Nov 02, 2020 - 15:41 CST

Update

We are continuing to monitor for any further issues.

Posted Nov 02, 2020 - 15:26 CST

Update

We are continuing to monitor for any further issues.

Posted Nov 02, 2020 - 15:25 CST

Update

The application remains unstable and we are continuing to monitor for any further issues.

Posted Nov 02, 2020 - 15:25 CST

Update

The application is back up and all sites are loading. We are continuing to monitor for issues as more users come back into the application. We have implemented more detailed logging to further enable our diagnosis of the root cause for the earlier outage. A full post mortem of the incident will come later today or tomorrow after full analysis.

Posted Nov 02, 2020 - 14:33 CST

Monitoring

A fix is in place for the current issue and we are actively bringing the application out of maintenance.

Posted Nov 02, 2020 - 14:18 CST

Update

We are continuing to investigate this issue.

Posted Nov 02, 2020 - 12:12 CST

Update

We continue to diagnose and make changes to resolve this issue. System is recovering but we are still seeing high volume of timeouts. We will update with more information as soon as we have it.

Posted Nov 02, 2020 - 11:08 CST

Investigating

We are experiencing an increase in load times coinciding with gateway time-out errors on the College Scheduler system. The team is aware of and actively working to resolve the issue.

Posted Nov 02, 2020 - 10:40 CST

This incident affected: College Scheduler.