Infrastructure
Environments
We have stage and production deployments of Balrog. Here’s a quick summary:
Environment |
App |
URL |
Deploys |
Purpose |
---|---|---|---|---|
Production |
Admin API |
Manually, after someone clicks a button in Jenkins (details below) |
Manage and serve production updates |
|
Admin UI |
||||
Public |
||||
Stage |
Admin API |
When version tags are created in Github |
A place to submit staging Releases and verify new Balrog code with automation |
|
Admin UI |
||||
Public |
Support & Escalation
RelEng is the first point of contact for issues. To contact them, follow the standard RelEng escalation path.
If RelEng is unable to correct the issue, or unavailable, it can be escalated to the Services SRE (Purple) team
Monitoring & Metrics
Metrics from deployment environments are available in Grafana and the GCP console.
We aggregate exceptions from both the Admin & Public apps to Sentry.
Application & HTTP Logs
Balrog publishes logs to BigQuery which are available for querying on Google Cloud. The relevant tables are:
requests - This table contains HTTP load balancer logs
stdout - This table contains application logs sent to stdout
stderr - This table contains application logs sent to stderr
Backups
Balrog uses the built-in GCP backups. The database in snapshotted nightly, and incremental backups are done throughout the day. If necessary, we have the ability to recover to within a 5 minute window. Database restoration is done by the Services SRE (Purple) team, and they should be contacted immediately if needed.
Deploying Changes
Balrog’s stage and production infrastructure are managed by Services SRE (Purple) team. Generally, Balrog is deployed on a regular schedule - every 2 weeks, being staged on a Tuesday and deployed to production on a Thursday.
Is now a good time?
Although we deploy on a regular schedule it is still important to check to make sure no urgent releases are ongoing before deploying. Post a message in the #releaseduty channel and wait for confirmation before proceeding with a production deploy.
Schema Upgrades
If you need to do a schema change you must ensure that either the current production code can run with your schema change applied, or that your new code can run with the old schema. Code and schema changes cannot be done at the same instant, so you must be able to support one of these scenarios. Generally, additive changes (column or table additions) should do the schema change first, while destructive changes (column or table deletions) should do the schema change second. You can simulate the upgrade with your local Docker containers to verify which is right for you. In staging and production, the schema upgrade is done automatically as part of the balrog-admin-production
deployment.
A quick way to find out if you have a schema change is to diff the current tip of the main branch against the currently deployed tag, eg:
tag=REPLACEME
git diff $tag
When deploying a change with schema upgrades it is important to deploy the services in the correct order. Generally, this means that balrog-admin-production
should be finished deploying before balrog-production
for additive changes, and balrog-production
should be finished deploying before balrog-admin-production
for destructive changes.
Deploying to Stage
To get the new code in stage you must create a new Release in Github as follows:
Tag the repository with a
vX.Y
tag. Eg:git tag -s vX.Y && git push --tags
Diff against the previous release tag. Eg:
git diff v2.24 v2.25
, to double whether or not there’s schema changes.
Look for anything unexpected.
Create a new Release on Github. This create new Docker images tagged with your version, and deploys them to stage. It may take upwards of 30 minutes for the deployment to happen. Deployment notifications will show up in #balrog on Slack.
Finally, bump the in-repo version to the next available one to ensure the next push gets a new version.
Once the changes are deployed to stage, you should do some testing to make sure that the new features, fixes, etc. are working properly there. It’s a good idea to watch Sentry for new exceptions that may show up, and Grafana for any notable changes in the shape of the traffic.
Important Note! Only two-part version numbers (like shown above) are supported by our deployment pipeline.
Pushing to Production
Pushing the backends live requires some button clicking in Jenkins. For each of balrog-admin-production
, balrog-production
, and balrog-agent-production
in Jenkins do the following. (If there are no schema changes, these may be done in parallel. If there are schema changes, see Schema Upgrades
):
Find the
PROD: DEPLOY
orPROD: PROCEED
stepClick the cell for this step in the topmost row. This should bring up a confirmation dialog as shown below.
Click
Proceed
After this, there is nothing else to do for balrog-admin-production
nor balrog-agent-production
. However, the public app (balrog-production
) will first deploy a canary (meaning the new code will only be used for a small fraction of requests).
Before proceeding, you should monitor for changes in load or exceptions for at least a few minutes. Specifically: - Watch Sentry to see if any new exceptions show up for any of the backend services - Watch the Grafana graphs for spikes or dips in any of the charts
If anything notable comes up you should seek an explanation for it before proceeding. If you are unable to explain the issue, consult with someone else and consider rolling back in the meantime.
When you are ready, find the PROD: PROMOTE
cell in Jenkins and click Proceed
to finish with this deployment.
To push new UI to production you must delete and recreate the “production-ui” tag & release on Github to push the new UI to production:
On https://github.com/mozilla-releng/balrog/releases/tag/production-ui, click “Delete” (this deletes the Github Release).
On https://github.com/mozilla-releng/balrog/releases/tag/production-ui, click “Delete” (this deletes the Git tag, even though it’s the same URL).
On https://github.com/mozilla-releng/balrog/releases/new, create a new production-ui Release. This will trigger automation to deploy the new UI.
Rollbacks
To rollback the admin, public, and agent backends, do the following for each of balrog-admin-production
, balrog-production
, and balrog-agent-production
in Jenkins:
Click “Build with Parameters” in the menu on the left.
Put the version you want to redeploy in the
ImageTag
field. This should be in the form ofvX.Y
, eg:v3.20
.Click
Build
As in this screenshot:
This will begin a deployment as described above. See the Pushing to Production
section above for how to proceed with the production deployment from here.
If the UI needs a rollback, after deleting the previous production-ui release and tag as above, update the “production-ui” tag to point to the earlier version. Something like (to point to v3.08):
git tag -d production-ui
git tag -s production-ui v3.08^{}
git push origin production-ui