Infrastructure
Environments
We have stage and production deployments of Balrog. Here’s a quick summary:
Environment |
App |
URL |
Deploys |
Purpose |
---|---|---|---|---|
Production |
Admin API |
Manually by CloudOps |
Manage and serve production updates |
|
Admin UI |
||||
Public |
||||
Stage |
Admin API |
When version tags are created in Github |
A place to submit staging Releases and verify new Balrog code with automation |
|
Admin UI |
||||
Public |
Support & Escalation
RelEng is the first point of contact for issues. To contact them, follow the standard RelEng escalation path.
If RelEng is unable to correct the issue, they may escalate to CloudOps.
Monitoring & Metrics
Metrics from deployment environments are available in Grafana and the GCP console.
We aggregate exceptions from both the Admin & Public apps to Sentry.
ELB Logs
Balrog publishes logs to S3 buckets which are available for querying on Redash. The relevant tables are:
balrog_elb_logs_aus{3,4,5} - These tables contain update request records sourced from the ELB logs of the named domain (eg: aus5). If you’re looking to do ad-hoc queries of update request (eg: estimate how many users are on a particular version or channel), the balrog_elb_logs_aus5 is probably what you want to query.
balrog_elb_logs_aus_api - This table contains request logs for the aus-api.mozilla.org domain
log_balrog_admin_nginx_access - This table contains access logs for the admin app sourced from nginx access logs.
log_balrog_admin_nginx_error - This table contains error logs for the admin app sourced from nginx error logs.
log_balrog_admin_syslog_admin_fixed - This table contains syslog output from the admin app’s Docker container.
log_balrog_admin_syslog_agent - This table contains syslog output from the agent’s Docker container.
log_balrog_web_syslog_web_fixed - This table contains syslog output from the public app’s Docker containers.
Redash should show you the table schemas in the pane on the left. If not, you can inspect with them with “describe $table”.
Backups
Balrog uses the built-in RDS backups. The database in snapshotted nightly, and incremental backups are done throughout the day. If necessary, we have the ability to recover to within a 5 minute window. Database restoration is done by CloudOps, and they should be contacted immediately if needed.
Deploying Changes
Balrog’s stage and production infrastructure are managed by CloudOps.
This section describes how to go from a reviewed patch to deploying it in production.
Is now a good time?
Before you deploy, consider whether or not it’s an appropriate time to. Some factors to consider:
Are we in the middle of an important release such as a chemspill? If so, it’s probably not a good time to deploy.
Is it Friday? You probably don’t want to deploy on a Friday except in extreme circumstances.
Do you have enough time to safely do a push? Most pushes take at most 30 minutes to complete once the production push has begun.
Schema Upgrades
If you need to do a schema change you must ensure that either the current production code can run with your schema change applied, or that your new code can run with the old schema. Code and schema changes cannot be done at the same instant, so you must be able to support one of these scenarios. Generally, additive changes (column or table additions) should do the schema change first, while destructive changes (column or table deletions) should do the schema change second. You can simulate the upgrade with your local Docker containers to verify which is right for you. In staging and production, the schema upgrade is done automatically as part of the balrog-admin deployment.
A quick way to find out if you have a schema change is to diff the current tip of the main branch against the currently deployed tag, eg:
tag=REPLACEME
git diff $tag
When you file the deployment bug (see below), include a note about the schema change in it. Something like:
This push requires a schema change, so admin should be deployed first to do the migration.
Bug 1772799 is an example of a push with a schema change.
Deploying to Stage
To get the new code in stage you must create a new Release in Github as follows:
Tag the repository with a
vX.Y
tag. Eg:git tag -s vX.Y && git push --tags
Diff against the previous release tag. Eg:
git diff v2.24 v2.25
, to double whether or not there’s schema changes.
Look for anything unexpected.
Create a new Release on Github. This create new Docker images tagged with your version, and deploys them to stage. It may take upwards of 30 minutes for the deployment to happen. Deployment notifications will show up in #balrog on Slack.
Once the changes are deployed to stage, you should do some testing to make sure that the new features, fixes, etc. are working properly there. It’s a good idea to watch Sentry for new exceptions that may show up, and Grafana for any notable changes in the shape of the traffic.
Important Note! Only two-part version numbers (like shown above) are supported by our deployment pipeline.
Pushing to Production
Pushing live requires CloudOps. For non-urgent pushes, you should begin this procedure a few hours in advance to give CloudOps time to notice and respond. For urgent pushes, file the bug immediately and escalate if no action is taken quickly. Either way, you must follow this procedure to push:
File a bug to have the new version pushed to production
Make sure you substitute the version number and choose the correct options from the bug template.
Before SRE start the deploy, notify #releaseduty:mozilla.org and #sheriffs:mozilla.org on matrix so they can both confirm no release activity is ongoing, and know to quickly escalate any fallout.
Once the push has happened, verify that the code was pushed to production by checking the __version__ endpoints on the Admin and Public apps.
Manually delete and recreate the “production-ui” tag & release on Github to push the new UI to production:
On https://github.com/mozilla-releng/balrog/releases/tag/production-ui, click “Delete” (this deletes the Github Release).
On https://github.com/mozilla-releng/balrog/releases/tag/production-ui, click “Delete” (this deletes the Git tag, even though it’s the same URL).
On https://github.com/mozilla-releng/balrog/releases/new, create a new production-ui Release. This will trigger automation to deploy the new UI.
Bump the in-repo version to the next available one to ensure the next push gets a new version.
Rollbacks
If something goes wrong, CloudOps can rollback to an earlier version on request.
If the UI needs a rollback, after deleting the previous production-ui release and tag as above, update the “production-ui” tag to point to the earlier version. Something like (to point to v3.08):
git tag -d production-ui
git tag -s production-ui v3.08^{}
git push origin production-ui