Maintenance Mode

Planned downtime

Sometimes we need planned downtime, for example when doing a database restore.

The goal is to stop users accessing the site, preventing them from seeing incorrect information or making changes that may be lost.

To do this we put the Digital Marketplace into ‘maintenance mode’. In this mode, any visitor to the Digital Marketplace will be served a static html page informing them of this down time.

Maintenance mode was first used in 2017 when we moved our underlying infrastructure from ElasticBeanstalk to the PaaS.

Accessing applications during maintenance

When maintenance mode is active, no-one is able to access our applications via their normal domains (preview.marketplace.team, staging.marketplace.team, and digitalmarketplace.service.gov.uk). Users see a static page served by the router app.

We also have a ‘recovery’ mode that allows developers access to the API apps, while still blocking user access to the Frontend apps. Recovery mode allows scripts such as the Jenkins re-indexing jobs to access the APIs in the normal way.

Developers can also bypass the router, accessing the apps via the Cloud Foundry/PaaS ‘internal’ domain, cloudapps.digital.

Note that the frontend applications are hosted at dm-<stage>.cloudapps.digital with their specific path suffixes, and are protected by an extra level of Basic Auth which will need to be supplied when accessed through this method (see the credentials repo).

Before putting the application into maintenance mode

Because the application is monitored, when we put the application into maintenance mode it will return a 503 to every request and so will be reported as down.

Before any planned maintenance, an email should be sent to TechOps with the following details:

  • What the period of maintenance is for

  • When the period of maintenance will be

  • How long the period of maintenance will last

This will allow them to add the details to the release calendar and stop any unnecessary alerts being sent.

Once the maintenance period has finished, TechOps should be emailed again to report that the Digital Marketplace is back up.

How to activate/deactivate

Note

Inform the Customer Support Centre before activating maintenance mode in case we get queries from users. If supplier applications are open, also inform the Sourcing and Category teams.

There are two ways to activate maintenance mode:

  1. With the Jenkins job Toggle maintenance mode

  2. Manually deploying a new version of the PaaS router application.

Automatically activating/deactivating maintenance mode

This is the preferred way to enable or disable maintenance mode and should be used in all cases where our infrastructure is running under normal conditions.

  1. Start a new build of the pipeline job Toggle maintenance mode on ci.marketplace.team. Select the appropriate target stage (preview, staging, production) and mode (maintenance, recovery or live).

  2. Once the build has started, a new Pull Request will be generated against the digitalmarketplace-aws repository. Review, approve, and merge this to master.

  3. After it has been merged to master, continue the Pipeline job in Jenkins by clicking ‘Proceed’ on the two input boxes on the current Pipeline stage.

Manually activating/deactivating maintenance mode

You should only need to manually deploy maintenance mode if some part of our infrastructure and deployment pipeline has failed, for example, Jenkins or Github are down.

Activate

  1. Update the target stage’s variables file that is used in our PaaS manifests - https://github.com/alphagov/digitalmarketplace-aws/blob/master/vars/<preview|staging|production>.yml. To turn maintenance mode on, set maintenance_mode: maintenance.

  2. You will need access to the PaaS space you wish to deploy to (likely production). You will need to deploy a new dockerised app. To deploy the router, run:

    STAGE=production APPLICATION_NAME=router RELEASE_NAME=<release_tag> make deploy-app
    

    The release_tag is of the format release-### and can be discovered by using cf app router and looking at the currently-deployed app’s docker image, which looks something like:

    docker image:      digitalmarketplace/router:release-18
    
  3. The release should only take a few minutes. After the release has completed, maintenance mode will be enabled.

Deactivate

  1. The process is the same as activation. However set maintenance_mode: live.

What does it look like?

../_images/maintenance-mode.png

What’s going on under the hood?

Setting DM_MODE on the manifest will pass it as an environment variable to the router app.

There are three modes:

  • DM_MODE = 'maintenance': healthcheck and metrics endpoints for the router only, all other routes will be directed to a static maintenance page (served by the router) and return a 503 status code.

  • DM_MODE = 'recovery': healthcheck and metrics endpoints for the router only, plus API apps. All other routes will be directed to a static maintenance page.

  • DM_MODE = 'live': all apps served as normal

This is achieved by changing which config files are loaded by the router based on the DM_MODE environment variable. See the nginx start up script in the router app for details.

Lack of sticky sessions

Before we moved our routing to the PaaS, our traffic used to come through Amazon Elastic Load Balancers (ELBs). This enabled us to use “sticky sessions” on the ELB to ensure that a user receives a consistent experience when the maintenance page is turned on. We no longer have this ability. This means that if you turn on maintenance mode, there will be a short period of time where both the old and new router instances are healthy and serving requests. If a user initially hits the old router app and is served the html for the homepage, they may receive a 503 when trying to load CSS or Javascript if the request is routed to the new router app.