This incident report relates to degraded service (particularly background tasks such as bulk actions and processing of inbound leads), on 10th and 15th April 2014.
Recent changes in workloads had left our background processing systems unable to cope with the load. This had a knock on effect on our main application and API.
During the incident on the 10th April, we managed workloads on our processing queues and collected data in order to investigate the cause. We took bulk actions, import and update tasks out of queues to allow higher priority work to complete. We tightened up limits on multiple background tasks - reducing allowances to 2 tasks per account (previously 6 per account). This is a temporary measure and will be lifted in the next few weeks. We'll notify you when that happens.
Service was again affected on the 15th April, and indications were that this was a re-occurrence of the problems on the 10th. We again moved some workloads around, and this time applied some code changes which gave us some immediate results.
On the evening of the 15th, now having fully diagnosed the bottlenecks, we made further configuration and hardware changes that we're confident will stop this situation occurring again. We're also planning some further work in the next week which will improve our capacity further. This does not require a maintenance window.
We know that problems like these are frustrating and hit your productivity. We need to monitor our workloads more closely, especially as we realise that small changes to workloads can have an exponentially degrading effect on service. So we'll be working on that.
Sorry for the disruption, and thanks using our platform.