This is a postmortem report on an outage of keyword search last Monday affecting all users, and subsequent inconsistencies in results from that point on up until the weekend.
We have been planning and implementing a new keyword search infrastructure since August. This was to replace our existing system which was becoming increasingly unreliable under increased load.
The new infrastructure was live and in production last Monday, but with minimal traffic directed to it. During the afternoon, several issues occurred with the old system, and it was decided to bring the new infrastructure fully online.
After about 90 minutes of operation at approximately 17:00, load spiked and the cluster began to fail queries. The resources were immediately scaled up and keyword search returned at 17:54.
During the evening, an operation was started on the search infrastructure to reindex all records in order to do a rolling upgrade to larger hardware.
The reindexing operation was expected to complete during the evening, but it actually continued for days.
The search cluster was available during this reindexing, but meant that 1) new records didn't always show up immediately, 2) some records were not discoverable.
A further issue was also identified that caused some records containing certain UTF-8 control characters not to reach the index.
Later in the week with the reindexing still not complete, it was decided to create a brand new cluster, load and index data, and point queries to it. That was completed by the weekend and finally resolved the issue.
We have clearly already identified weaknesses in the new keyword search infrastructure that we didn't during planning and testing. We'll be assessing what improvements can be made.
In hindsight we should have concentrated on restoring the existing infrastructure last Monday rather than accelerating our plan of introducing the new cluster in a phased rollout.
Finally, we have implemented a flag in the platform which allows us to disable keyword search through the UI if we ever have an outage again. This will allow us to reduce load to zero whilst we work on the issue, and also allow us to inform all users about the alternative method of locating records (using 'Add Filter').
We know that your users rely on keyword search, and we're very sorry about this incident. We are confident that our new infrastructure will be much faster and more resilient than before.