Responding to Cloudwatch alerts

We have three types of CloudWatch alerts that get sent to the #dm-2ndline Slack channel. These should be investigated by the 2nd line developers as soon as possible after they appear.

See Alerting for more information on how the alerts are set up.

Investigating alerts

Production-500s

  • On Kibana, use the ‘5xx requests’ shortcut to find the request that caused the error. This should provide information about what the user was doing at the time, and how many requests failed.

  • Try to determine the user impact from the logs. Look at whether the request was made by a human user, or a script/smoke test. Also, if possible, check if the request was subsequently retried successfully.

  • If you cannot determine the cause from Kibana, check app metrics on the Grafana dashboards. Look for recent crashes or a high number of concurrent requests.

Production-router-slow-requests-gt10s

  • Check Kibana as above (using the ‘Slow requests’ shortcut) and see if the endpoints are unusually slow.

  • Check the app metrics on Grafana as above.

Production-429s

  • Check Kibana as above, but include the userAgent field in the query. This field will suggest whether the user is a human or a bot.

  • Some bots have a userAgent that is indistinguishable from those of humans. You can identify them by looking at the pattern of traffic from their userAgent. They may be sending traffic in a pattern that no human could manage.

  • If a human user is seeing 429 errors through normal behaviour, you may want to adjust the router app rate limiting settings.

  • To avoid double counting requests in the nginx logs, add httpHost="www.digitalmarketplace.service.gov.uk" to your query. This will filter out the logs produced by nginx serving up the static 429 error page.

Next steps

For all the error types above, once you’ve identified the cause, reply to the Slack alert (in a thread) to inform the team. Ask for help if you’re stuck!

If the error was due to a bug, create a card either on the 2nd line Trello Board (for urgent problems) or on the Tech Debt Trello board (for non urgent problems).

If the error has a low impact or is intermittent, consider adding to the 2nd line Trello Board ‘watchlist’ and monitoring for a week or two. If the problem gets worse then there will be a record of what’s happened so far. If the problem doesn’t recur then it can be moved to ‘Done’.

For ongoing issues with a high user impact, follow the incident process.