Service Interruption

Incident Report for FLG

Postmortem

What Happened?

At approximately Apr 07, 2016 - 12:40 BST we were alerted to an outage by our automated monitoring. We immediately started investigating and identified that our primary database was backed up with connections.

Shortly after, at about 12:45 BST we identified the issue was caused by the logs 'logical volume' (disk space) being full.

We then attempted to increase the space allocated to it, but found that the arrangement of the disk didn't allow for this easily. Instead, we opted to curtail the logs with a database command, which we started immediately.

It's usually preferable to restore a primary server if that's possible, rather than failing over to a secondary. Our initial plan was to restore the primary. At this stage, just after we started the operation to curtail the logs, we decided to failover.

However, because of the curtailment operation, the secondary servers had stopped replicating, and we no longer had a synced (or near synced) secondary to failover to. We didn't understand the reason for this at this time, only that a 'normal' failover wasn't possible.

At approximately 13:20 BST the curtailment operation had completed and we could restore service through the same primary server.

At around 13:50 BST we understood the reason for the secondary server failures, and immediately withdrew service again. From this point, we only had the primary available.

We placed the system into a degraded state, disabled some features, and at approximately 14:51 BST we had restored service.

During the evening, we re-synced the secondary servers from the primary and the cluster was back at capacity.

Where did it go wrong?

We have monitoring for the disk usage but not for this particular situation (it was simplified above for clarity).
The curtailment operation itself was problematic and a failover at that point would have been wiser in hindsight.
We were too slow to realise that the secondary servers were no longer receiving events from the primary. This slowed recovery down.
Our system could not be forced into an 'offline' state, meaning we had to show a 'maintenance' message. This wasn't intentional.

What changes will we make?

To review all our monitoring, including for disk states.
Fix the problem with the 'offline' state (a patch is ready to go for this).
We will document a set of checks to run through after any incident, so that we can be absolutely certain of the health of each component, rather than looking at our monitoring as a whole.
Failover will receive higher priority as a recovery option. We will consider implementing a recovery target (time window), after which we must pursue a failover.

We're very sorry

We know you rely on us for 100% uptime, and we're sorry we let you down here. Rest assured we will learn and adapt because of this incident.

Posted Apr 08, 2016 - 17:23 BST

Resolved

The incident is now resolved. In case you didn't get our prior update, reports are available again. We're leaving the task reminders in the menu bar disabled until late this evening.

We will be writing a postmortem to explain what happened here, please check back by the end of the day to read. It will be published here. We know that uptime is crucial to you, so it's our first priority too. Sorry to have put you through these problems yesterday.

Posted Apr 08, 2016 - 07:31 BST

Monitoring

Reports are available again. Thanks for your patience whilst waiting for them to return. We continue to monitor the service for the moment. Everything is running normally, except for task counts which are still unavailable.

Posted Apr 07, 2016 - 21:07 BST

Update

The service is now stable, but some features (mainly task counts in the menu bar, and reports) are still disabled. We do have a lot of work to do in order to restore the service completely.

In the meantime, we expect the system to remain available as it currently is. We'll bring reports back online as soon as we can (most likely late this evening), and task counts will return once the service is running normally again.

We will also be posting a full postmortem once the issue is fully resolved. We'll now move to less frequent updates, as we have more updates on this issue. We're really sorry about the problems this afternoon, expect a full explanation in our postmortem as soon as we can get that to you.

Posted Apr 07, 2016 - 15:04 BST

Update

You may have noticed certain features within FLG being disabled over the last 15 minutes. Currently, the service is available, but the task count in the menu bar is disabled, and reports are currently disabled too. You will now be seeing consistent data, after users may have seen some cached data over the last 2 hours. We'll provide another update soon.

Posted Apr 07, 2016 - 14:51 BST

Update

We have a secondary issue related to the earlier outage. The current situation is that the service is available, but very slow. We'll keep you updated here every 30 minutes or sooner.

Posted Apr 07, 2016 - 14:33 BST

Identified

We still appear to be dealing with problems reading from secondary databases, which means you may be seeing outdated information. We'll update again here within 30 minutes or sooner to provide more details and updates.

Posted Apr 07, 2016 - 14:03 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 07, 2016 - 13:38 BST

Update

We're back online again. We're still working on this though so will update again within 30 minutes or sooner. All services are available again now.

Posted Apr 07, 2016 - 13:31 BST

Update

We should have a resolution very soon. Sorry for the problems here and we'll update again shortly, 30 minutes or sooner.

Posted Apr 07, 2016 - 13:08 BST

Identified

We're working to fix the problem. We'll update here in 30 minutes or sooner.

Posted Apr 07, 2016 - 12:51 BST

Investigating

We are currently investigating this issue.

Posted Apr 07, 2016 - 12:45 BST