Database backups

We produce our own backups for the purpose of populating our development and local environments with up to date, sanitised data based on production data using the clean-and-apply-db-dump Jenkins jobs.

For Restoring from a backup in a disaster scenario we use PaaS backups which offer a managed service and more fine grained point in time recovery options.

How our database backups are created

Backups are generated by a Jenkins job at 4am every morning. A dump is taken from the production database, which is then zipped, encrypted and uploaded to a replicated S3 bucket, digitalmarketplace-database-backups for storage. This is done in the following steps:

  • Jenkins creates a unique file name for the dump in the format <stage>-yyyyMMddHHmm.sql.gz.gpg

  • Jenkins deploys a worker app, db-backup to the PaaS, using the deploy-db-backup-app Makefile command in the digitalmarketplace-aws repo. This app has the scripts required to create and upload the dump baked into its Docker image (see Digital Marketplace Docker Hub).

  • The following environment variables are required by the app manifest: - DUMP_FILE_NAME - the unique filename generated by Jenkins (passed as an argument to deploy-db-backup-app) - S3_POST_URL_DATA - A signed url and extra data for POSTing the dump to S3. Explained below. - RECIPIENT - Used for encryption with GPG, it signifies which public key to use to encrypt.

  • An additional variable PUBKEY (the public key used for encryption) is set after the app has spun up.

  • The db-backup app then starts a task container which executes the create-db-dump.sh. This container has its own disk and memory quotas, needed to handle the large file size.

Note

The clean-and-apply-db-dump Jenkins build can sometimes fail which can take the lower environments offline. The cause of this tends to be because a file created by the build has become corrupted. Restarting the build should fix this issue and bring the lower environments back online.

What our backup script does

The create-db-dump.sh script first imports PUBKEY into GPG2. It then connects to the database instance in the PaaS and uses pg_dump to create a plaintext dump with no owner and no access control list. The dump is streamed to gzip and then straight to GPG2 for encryption before being written to disk.

Next, a python script, upload-dump-to-s3.py, is executed for uploading the dump to S3. It uses S3_POST_URL_DATA (the signed S3 url generated earlier) and will return an error if upload fails.

Next, Jenkins checks that the new encrypted dump in S3 can be decrypted. This is to ensure that the private key used to decrypt the dumps is the correct counterpart of the public key used to encrypt. If the private key was rotated and the public key wasn’t for some reason, we wouldn’t know about it until too late without this check. Jenkins uses a script called check-db-dump-is-decryptable.sh.

The decrypt script downloads the new dump from S3. It then decrypts and imports the private GPG key from the credentials repo and imports it. GPG then executes a --list-packets command on the dump. We don’t actually care about the packets, but the command needs the correct private key to operate successfully. It means we can test decryption without actually having to decrypt. Finally it deletes the secret key as well as the downloaded dump.

Finally, Jenkins alerts slack with either a success or failure message and deletes the db-backup app from the PaaS.

The S3 buckets

The bucket where zipped and encrypted dumps are stored in the first instance is called digitalmarketplace-database-backups and is in the Digital Marketplace Backups AWS account.

This bucket has cross region replication enabled and will replicate all new objects to another bucket called digitalmarketplace-cross-region-database-backups in the eu-west-2 (london) region.

The buckets are accessible to 1 group and the Jenkins role. The group is called ‘backups’ and contains the users currently in the production_infrastructure group. This means that users on 2nd line support as well as permanent admins will be able to GET the backup files. The Jenkins role only has permissions to PUT or GET on the bucket to prevent deletion of dumps.

The buckets sit in the digitalmarketplace-backups account which can only be accessed using a password reset.

The backups in digitalmarketplace-database-backups and digitalmarketplace-cross-region-database-backups are retained for 180 and 7 days respectively.

Signed S3 URL’s

The ‘S3_POST_URL_DATA’ is generated by a script in the AWS repo called generate-s3-post-url-data.py. It needs to be executed by an AWS entity with the correct rights to upload to the S3 Jenkins bucket. In our case this is the Jenkins role assumed by the Jenkins server. The signed URL can then be used by an entity with no permissions on the bucket.

GPG

The dumps are being encrypted with GPG2. The public and private keys being used are kept in the digitalmarketplace-credentials repo. The private key is encrypted with SOPS in the usual way. The public key is unencrypted. The private key has a passphrase which is required to use it. This is also in the credentials repo and is also encrypted with SOPS.

The keys use RSA 4096.

Restoring from a backup

There is no automatic process to restore the production database from one of the dumps. If we’re in the situation where it needs to happen, it’s probably quite a serious situation and should probably be done manually. The steps will be similar to below:

  • Alert the team on the #dm-release Slack channel, and grab the deploy gorilla.

  • Disable writes to the database, either by putting the site into Maintenance mode (preferred option) or stopping the API app in PaaS (only as a last resort). Consider disabling smoke/smoulder tests.

  • Ensure that you’re logged in to Cloud Foundry and are in the production space (if that’s where you’re restoring to):

    cf target -s production
    
  • Follow the restore steps in the PaaS manual to create a new PostgreSQL service from a snapshot or point in time (depending on need). Give the new service a descriptive name like restored-db.

  • The restore will take at least 15 minutes to run. If you need a cup of tea, now is the time.

  • Update the api app to use the new service by changing the production app manifest variables. Note: you will need to copy the entire services block from common.yml to api to stop other service bindings being removed. Once this change is merged, re-release using the Jenkins job rerelease-all-apps.

  • Test that the data has restored correctly (https://dm-api-production.cloudapps.digital/_status should respond even during maintenance mode).

  • Let stakeholders know that the restore has been completed.

  • Ensure the team has a plan for reconciling any lost data, and how this will be communicated to users.

  • Rename the existing digitalmarketplace_api_db service to something like digitalmarketplace_api_db_old (or just delete it altogether), and rename the restored-db service to digitalmarketplace_api_db.

  • Revert the change to the production app manifest variables, and rerelease all apps as above.

  • Toggle maintenance mode to ‘recovery’ to restore access to the API apps only.

  • Re-sync the OpenSearch indices for services and briefs, using the Jenkins catchup jobs:

  • Toggle maintenance mode to ‘live’ to restore access to the Frontend apps.

  • Re-enable smoke/smoulder tests (if disabled earlier). And relax.