Service Unavailable
Incident Report for FLG
Postmortem

** What happened? **

From approximately 14:30 to 14:44 GMT on 17th March, and from 18:05 to 18:30 GMT on the same day, the FLG service was unavailable.

** How did this affect the service? **

Users would have seen a 'service is unavailable' notice when they tried to access, and API requests and web forms would have failed.

Background actions such as workflows and lead actions would have queued up and been processed after each incident. Because of the backlog, it took up to 15 minutes after each incident for background processing to fully catch up.

** What caused the incidents? **

Our main database went completely offline during the first incident, and as all of the remaining services remained online, we failed over to standby hardware.

We were investigating the first incident when the second one occurred, this time taking 2 database servers offline, including the new main database. This time, we took the decision not to failover again, but instead to recover the offline services.

The recovery was completed at 18:30 GMT. By 22:07 GMT the same day, all services were back online, including the server involved in the first incident.

** What was found to be the root cause? **

Engineers working on our behalf had installed a healthcheck process to monitor the health of the RAID cards in a number of our servers the day before and on the morning of this incident.

Although the script was innocuous enough, certain condition(s) caused the server to hang when it ran.

** How can this be prevented in future? **

The script has now been removed from all of our servers, and will be tested thoroughly and phased in, if we need to introduce it again.

We realise that FLG is a core application for our customers, and any kind of downtime, even momentarily, is unacceptable. We're very sorry for the problems these incidents would have caused your business on Friday.

Thanks for using FLG.

Posted Mar 20, 2017 - 11:03 GMT

Resolved
This incident is resolved. We'll be investigating this alongside the earlier incident today as the highest priority.
Posted Mar 17, 2017 - 19:28 GMT
Monitoring
Service is restored but we're urgently looking into how the incident from earlier has essentially reoccurred. We'll update here again when we have news. For now, the service is running normally.
Posted Mar 17, 2017 - 18:36 GMT
Identified
We're restarting some services to bring service back online. We'll bring you more updates as we have them, or within 30 minutes.
Posted Mar 17, 2017 - 18:18 GMT
Update
We're looking into this now. We'll update again here within 30 minutes or sooner.
Posted Mar 17, 2017 - 18:09 GMT
Investigating
We are currently investigating this issue.
Posted Mar 17, 2017 - 18:05 GMT