What Happened
At approximately 7:25am Pacific time on Nov 2, 2020, the College Scheduler application experienced a spike in traffic correlated to the opening of Spring 2021 Registration. The application handled the traffic well until a sharp increase in volume within a 20-30 second span resulted in timeouts and subsequently HTTP errors (specifically 502 and 504 errors) to be displayed to student and staff users. This made College Scheduler slow to respond and soon after completely unavailable. Unfortunately this lasted for a period of approximately 6.5 hours as we worked to bring the system back online.
Why Did it Fail and Why Did Last So Long
The service responsible for authenticating users into the College Scheduler application became unresponsive during this event. Newly added service instances (identical virtual machines) were immediately deemed unhealthy by automated health checks, blocking our efforts to remedy the problem. Meanwhile, whereas a typical spike in login activity typically subsides as valid login “tokens” are issued and then reused in subsequent requests, this event resulted in a sustained level of invalid (expired tokens) login activity that compounded the number of requests our authentication servers were handling. No amount of effort to add resources could cope with the volume of queued requests that needed to be processed.
What Existing Process or Safeguards Failed
College Scheduler is heavily load-tested using real world data and an exact replica of production. This system is designed for burst loads of traffic, which we encounter every day during registration periods and throughout the year (e.g. during freshman orientation events). An example of such a burst is shown below, where between 6:58am and 7:00am the application experienced an almost immediate 400% increase in traffic.
With nearly 300 customers, it would be very challenging to track anticipated registration / enrollment dates for each institution and schedule around these bursts. As such, our approach is to replicate these burst registration conditions using our loadtest process in a controlled environment and scale for these events automatically. These tests have led to dozens of learnings and resolved many issues previously encountered during normal use of College Scheduler.
Still, our load tests are historically modeled on the assumption that login requests succeed and that subsequent activity is properly authenticated. The Nov 2 event, however, resulted in a large number of failed login events, in very quick succession. This illustrated that the application, in this instance, was susceptible to prolonged wait times which resulted rapid thread exhaustion. This was compounded throughout the day as inbound traffic to College Scheduler remained heavy.
Changes We Made to Fix
Once it was clear that the scaling procedures in place were not effective, we put the application into maintenance mode to try and stem the tide of incoming requests to the service. The maintenance page was in place for 14 minutes, which allowed enough time for previously allocated service instances to become healthy. However, after the maintenance page was lifted, the application became unstable again within 18 minutes. At this time, a change was made to expand resources approximately 7 times levels our highest anticipated levels of capacity. After this service allocation, the system stabilized and remained healthy while traffic levels subsided.
What We Are Doing to Correct the Root Cause
The College Scheduler system is being enhanced to more appropriately scale to the extreme levels of student traffic we see in registration periods. Specifically, our system is being tuned to ensure a more rapid scaling speed combined with more proactive pre-allocation of resources leading up to registration season. We have already replicated the Nov 2 issue in a controlled environment and are working through our autoscaling improvements now. In the interim, our infrastructure has remained purposefully over-allocated to greatly exceed the capacity needed to respond to these spikes as registration season continues.
We deeply regret the downtime that occurred on Nov 2 and have been working diligently to understand the issue deeply and to further understand why our load testing and production systems were not tuned for this event. Rest assured we are striving to be better and to make sure that you can trust Civitas Learning with your next registration period.