Slow Service

Incident Report for FLG

Postmortem

This incident report relates to degraded service (particularly background tasks such as bulk actions and processing of inbound leads), on 10th and 15th April 2014.

Recent changes in workloads had left our background processing systems unable to cope with the load. This had a knock on effect on our main application and API.

During the incident on the 10th April, we managed workloads on our processing queues and collected data in order to investigate the cause. We took bulk actions, import and update tasks out of queues to allow higher priority work to complete. We tightened up limits on multiple background tasks - reducing allowances to 2 tasks per account (previously 6 per account). This is a temporary measure and will be lifted in the next few weeks. We'll notify you when that happens.

Service was again affected on the 15th April, and indications were that this was a re-occurrence of the problems on the 10th. We again moved some workloads around, and this time applied some code changes which gave us some immediate results.

On the evening of the 15th, now having fully diagnosed the bottlenecks, we made further configuration and hardware changes that we're confident will stop this situation occurring again. We're also planning some further work in the next week which will improve our capacity further. This does not require a maintenance window.

We know that problems like these are frustrating and hit your productivity. We need to monitor our workloads more closely, especially as we realise that small changes to workloads can have an exponentially degrading effect on service. So we'll be working on that.

Sorry for the disruption, and thanks using our platform.

Posted Apr 17, 2014 - 12:39 BST

Resolved

Service is almost back to 100% now. We'll post an incident report here once we can explain more about the incident and our next steps. Thank you.

Posted Apr 15, 2014 - 16:14 BST

Update

Today's issue and the incident on the 10th April (background queue delays) are essentially the same thing, and the root cause is a backlog which builds up because of an excess of webhook triggers to process.

We need to let background processing clear itself down - any short term gain from taking jobs out of the queue will eventually lead to a greater impact for some customers as relatively short delays become much longer.

The effects are again, 1) delayed inbound leads, currently at ~10 minutes, 2) delayed webhooks out, and 3) sluggish general response times in the web UI & API.

We've already applied a couple of patches to move workload around with the aim to improve the sluggishness overall - that's our main priority. There will be an incident report for this issue and the linked problem from the 10th, once we've properly investigated and made the changes we need to.

We will update again here when the situation improves. Thanks for putting up with a frustrating service level today.

Posted Apr 15, 2014 - 12:47 BST

Monitoring

We're monitoring the situation - we've moved some workloads around to improve response times.

Posted Apr 15, 2014 - 12:00 BST

Investigating

We're looking into why the service is running so slowly right now. Updates here soon..

Posted Apr 15, 2014 - 10:20 BST