At approximately Apr 07, 2016 - 12:40 BST we were alerted to an outage by our automated monitoring. We immediately started investigating and identified that our primary database was backed up with connections.
Shortly after, at about 12:45 BST we identified the issue was caused by the logs 'logical volume' (disk space) being full.
We then attempted to increase the space allocated to it, but found that the arrangement of the disk didn't allow for this easily. Instead, we opted to curtail the logs with a database command, which we started immediately.
It's usually preferable to restore a primary server if that's possible, rather than failing over to a secondary. Our initial plan was to restore the primary. At this stage, just after we started the operation to curtail the logs, we decided to failover.
However, because of the curtailment operation, the secondary servers had stopped replicating, and we no longer had a synced (or near synced) secondary to failover to. We didn't understand the reason for this at this time, only that a 'normal' failover wasn't possible.
At approximately 13:20 BST the curtailment operation had completed and we could restore service through the same primary server.
At around 13:50 BST we understood the reason for the secondary server failures, and immediately withdrew service again. From this point, we only had the primary available.
We placed the system into a degraded state, disabled some features, and at approximately 14:51 BST we had restored service.
During the evening, we re-synced the secondary servers from the primary and the cluster was back at capacity.
We know you rely on us for 100% uptime, and we're sorry we let you down here. Rest assured we will learn and adapt because of this incident.