** What happened? **
From approximately 14:30 to 14:44 GMT on 17th March, and from 18:05 to 18:30 GMT on the same day, the FLG service was unavailable.
** How did this affect the service? **
Users would have seen a 'service is unavailable' notice when they tried to access, and API requests and web forms would have failed.
Background actions such as workflows and lead actions would have queued up and been processed after each incident. Because of the backlog, it took up to 15 minutes after each incident for background processing to fully catch up.
** What caused the incidents? **
Our main database went completely offline during the first incident, and as all of the remaining services remained online, we failed over to standby hardware.
We were investigating the first incident when the second one occurred, this time taking 2 database servers offline, including the new main database. This time, we took the decision not to failover again, but instead to recover the offline services.
The recovery was completed at 18:30 GMT. By 22:07 GMT the same day, all services were back online, including the server involved in the first incident.
** What was found to be the root cause? **
Engineers working on our behalf had installed a healthcheck process to monitor the health of the RAID cards in a number of our servers the day before and on the morning of this incident.
Although the script was innocuous enough, certain condition(s) caused the server to hang when it ran.
** How can this be prevented in future? **
The script has now been removed from all of our servers, and will be tested thoroughly and phased in, if we need to introduce it again.
We realise that FLG is a core application for our customers, and any kind of downtime, even momentarily, is unacceptable. We're very sorry for the problems these incidents would have caused your business on Friday.
Thanks for using FLG.