Incidents¶
Grading an incident¶
Each incident must be assigned a priority level 1 to 4 based on:
the number of users affected
the impact on those users
whether one or more critical journeys have been disrupted (critical journeys change throughout the year depending on the frameworks’ lifecycles)
Based on the priority, there are different response times.
Roughly speaking, priorities are as follows:
P1 is critical: a complete outage, critical journeys are disrupted or there is a major security breach; for example:
users can’t access G-Cloud catalogue or list of opportunities
users can’t submit their services or supplier declaration during the submissions process
users can’t log in or access the Digital Marketplace homepage
P2 is major: a substantial degradation of service; for example:
users see a ‘technical issues’ page when publishing a brief or completing their application
P3 is significant: users experience intermittent or degraded service due to a platform issue; for example:
users can’t return a signed framework agreement
users can’t download a framework agreement
P4 is minor: component failure that does not immediately affect the service
The incident lead may change the priority level during the incident if its impact changes or becomes clearer.
P1 and P2 process¶
Take a breath. Everything will be fine.
If you’ve just found out about an incident, you may want to be the incident and comms lead. You can pass the responsibility on to a Delivery Manager, Product Manager, Technical Architect or other member of the team at any point.
If you are the incident and comms lead you will:
Ensure people are aware that there is an incident:
post on the #dm-incidents Slack channel with an @here with a brief description of what is happening
share the post on #general Slack channel with an @here
Establish a technical lead. This is usually the most experienced person on 2nd line or in the wider team
If you are the technical lead you will:
lead and coordinate the technical investigation
support the incident lead and keep them updated on the technical progress
add to the incident report once it’s created
Assess the impact of the incident and grade it, based on Digital Marketplace Incident Responses, and communicate it on #dm-incidents
communicate it on the #dm-incidents channel
Start an incident report by making a copy of the Incident Report Template
share the link on the #dm-incidents channel so that the incident team can contribute
work with the team to fill the overview and impact sections
record important events and changes in the timeline
Establish a person to delegate comms to who will become the comms lead
If you are the comms lead you will:
- send an email (within 30 mins for P1s or 1 hour for P2s) using the incident started template to:
if a data breach is suspected include:
if the incident involves a security breach include:
discuss with CCS whether users should be notified (and how)
keep communicating with the stakeholders every 1 hour for P1s or 2 hours for P2s, and in the case of important updates using the incident update template
communicate with the stakeholders once the incident is resolved using the incident resolved template
Create a Trello card tagged with the appropriate P level label on the 2ndLine Trello Board
Once the incident is resolved
ensure stakeholders have been notified
update Trello card with relevant details
together with the rest of the incident team, finalise the incident report including all relevant information (such as Slack channels conversations).
add an entry to the Incidents Summary spreadsheet
arrange an incident review meeting with the incident team at a minimum and invite everyone in the wider team should know that the meeting is taking place and may request to be invited)
communicate ongoing actions to the CCS support team where required
P3 and P4 process¶
Create a Trello card tagged with the appropriate P level label on the 2ndLine Trello Board
add a description of the issue
include as much information as possible, links to status pages, links to email groups and messages, copy-pasted slack conversations.
highlight whether or not there are any imminent risks that might cause this issue to be upgraded (for example, a failure in the backup storage and an impending backup run or a failure in the email service and an impending email run)
Notify relevant developers with an @here in the #dm-2ndline Slack channel
If the service is significantly degraded notify the #general Slack channel and the CCS support team
if the incident involves a data or security breach, notify the Cyber Security team via infosec@crowncommercial.gov.uk
Keep the Trello card up to date
Once the incident is resolved
let the CCS support team know
create an incident report and hold an informal incident review if there is something to learn from the incident