Database backups¶
We produce our own backups for the purpose of populating our development and local environments with up to date, sanitised data based on production data using the clean-and-apply-db-dump Jenkins jobs.
For Restoring from a backup in a disaster scenario we use PaaS backups which offer a managed service and more fine grained point in time recovery options.
How our database backups are created¶
Backups are generated by a Jenkins job at 4am every morning. A dump is taken from the production database, which is then
zipped, encrypted and uploaded to a replicated S3 bucket, digitalmarketplace-database-backups
for storage. This is
done in the following steps:
Jenkins creates a unique file name for the dump in the format
<stage>-yyyyMMddHHmm.sql.gz.gpg
Jenkins deploys a worker app,
db-backup
to the PaaS, using thedeploy-db-backup-app
Makefile command in the digitalmarketplace-aws repo. This app has the scripts required to create and upload the dump baked into its Docker image (see Digital Marketplace Docker Hub).The following environment variables are required by the app manifest: -
DUMP_FILE_NAME
- the unique filename generated by Jenkins (passed as an argument todeploy-db-backup-app
) -S3_POST_URL_DATA
- A signed url and extra data for POSTing the dump to S3. Explained below. -RECIPIENT
- Used for encryption with GPG, it signifies which public key to use to encrypt.An additional variable
PUBKEY
(the public key used for encryption) is set after the app has spun up.The
db-backup
app then starts a task container which executes thecreate-db-dump.sh
. This container has its own disk and memory quotas, needed to handle the large file size.
Note
The clean-and-apply-db-dump
Jenkins build can sometimes fail which can take the lower environments offline.
The cause of this tends to be because a file created by the build has become corrupted.
Restarting the build should fix this issue and bring the lower environments back online.
What our backup script does¶
The create-db-dump.sh
script first imports PUBKEY
into GPG2. It then connects to the database instance in the PaaS and
uses pg_dump
to create a plaintext dump with no owner and no access control list. The dump is streamed to gzip and then
straight to GPG2 for encryption before being written to disk.
Next, a python script, upload-dump-to-s3.py
, is executed for uploading the dump to S3. It uses S3_POST_URL_DATA
(the
signed S3 url generated earlier) and will return an error if upload fails.
Next, Jenkins checks that the new encrypted dump in S3 can be decrypted. This is to ensure that the private key used to
decrypt the dumps is the correct counterpart of the public key used to encrypt. If the private key was rotated and the public
key wasn’t for some reason, we wouldn’t know about it until too late without this check. Jenkins uses a script called
check-db-dump-is-decryptable.sh
.
The decrypt script downloads the new dump from S3. It then decrypts and imports the private GPG key from the credentials
repo and imports it. GPG then executes a --list-packets
command on the dump. We don’t actually care about the packets,
but the command needs the correct private key to operate successfully. It means we can test decryption without actually
having to decrypt. Finally it deletes the secret key as well as the downloaded dump.
Finally, Jenkins alerts slack with either a success or failure message and deletes the db-backup
app from the PaaS.
The S3 buckets¶
The bucket where zipped and encrypted dumps are stored in the first instance is called
digitalmarketplace-database-backups
and is in the Digital Marketplace Backups AWS account.
This bucket has cross region replication enabled and will replicate all new objects to another bucket called
digitalmarketplace-cross-region-database-backups
in the eu-west-2 (london)
region.
The buckets are accessible to 1 group and the Jenkins role.
The group is called ‘backups’ and contains the users currently in the production_infrastructure
group.
This means that users on 2nd line support as well as permanent admins will be able to GET the backup files.
The Jenkins role only has permissions to PUT or GET on the bucket to prevent deletion of dumps.
The buckets sit in the digitalmarketplace-backups account which can only be accessed using a password reset.
The backups in digitalmarketplace-database-backups
and digitalmarketplace-cross-region-database-backups
are
retained for 180 and 7 days respectively.
Signed S3 URL’s¶
The ‘S3_POST_URL_DATA’ is generated by a script in the AWS repo called generate-s3-post-url-data.py
. It needs to be
executed by an AWS entity with the correct rights to upload to the S3 Jenkins bucket. In our case this is the Jenkins role
assumed by the Jenkins server. The signed URL can then be used by an entity with no permissions on the bucket.
GPG¶
The dumps are being encrypted with GPG2. The public and private keys being used are kept in the digitalmarketplace-credentials repo. The private key is encrypted with SOPS in the usual way. The public key is unencrypted. The private key has a passphrase which is required to use it. This is also in the credentials repo and is also encrypted with SOPS.
The keys use RSA 4096.
Restoring from a backup¶
There is no automatic process to restore the production database from one of the dumps. If we’re in the situation where it needs to happen, it’s probably quite a serious situation and should probably be done manually. The steps will be similar to below:
Alert the team on the
#dm-release
Slack channel, and grab the deploy gorilla.Disable writes to the database, either by putting the site into Maintenance mode (preferred option) or stopping the API app in PaaS (only as a last resort). Consider disabling smoke/smoulder tests.
Ensure that you’re logged in to Cloud Foundry and are in the production space (if that’s where you’re restoring to):
cf target -s productionFollow the restore steps in the PaaS manual to create a new PostgreSQL service from a snapshot or point in time (depending on need). Give the new service a descriptive name like
restored-db
.The restore will take at least 15 minutes to run. If you need a cup of tea, now is the time.
Update the api app to use the new service by changing the production app manifest variables. Note: you will need to copy the entire
services
block from common.yml toapi
to stop other service bindings being removed. Once this change is merged, re-release using the Jenkins job rerelease-all-apps.Test that the data has restored correctly (https://dm-api-production.cloudapps.digital/_status should respond even during maintenance mode).
Let stakeholders know that the restore has been completed.
Ensure the team has a plan for reconciling any lost data, and how this will be communicated to users.
Rename the existing
digitalmarketplace_api_db
service to something likedigitalmarketplace_api_db_old
(or just delete it altogether), and rename therestored-db
service todigitalmarketplace_api_db
.Revert the change to the production app manifest variables, and rerelease all apps as above.
Toggle maintenance mode to ‘recovery’ to restore access to the API apps only.
Re-sync the OpenSearch indices for services and briefs, using the Jenkins catchup jobs:
Toggle maintenance mode to ‘live’ to restore access to the Frontend apps.
Re-enable smoke/smoulder tests (if disabled earlier). And relax.