The reason why we do this is that nobody wants to work on a Friday evening. Deploying on Fridays keeps us motivated to make sure our system ready to be deployed at all times, and that is important to us. Having a system that can be deployed anytime allows us to quickly push out new features as well as fixing incorrect behaviour when needed.

We follow a few practices that I believe are key to keep a system deployable. These have worked great for us, and if you are pursuing a situation where you can deploy your system at any time, I think these are a good start.

Excellent test coverage

We have a lot of tests, in all flavours. There are unit-tests, service-level tests, integration-tests, end-to-end tests. All in all, there are about 35 000 tests (back-end and front-end) for one of the applications (application is roughly 160 KLOC).

By using Test-Driven Development (TDD) we get a large number of tests and a high test coverage. We do not measure test-coverage, and we see no reason to start, as it says nothing about the quality of the tests. The quality of tests are maintained by following the red-green-refactor cycle in TDD, where each test start by failing, with a clear, descriptive failure reason. We also review all test code as thoroughly as any other code.

Having a large test suite is really a key requirement to be able to release with a high confidence that nothing is broken. When I make a change to a part of the system I am not too familiar with, I often start with introducing something that will break it to see which tests that fail. That helps me understand what new tests might be necessary as well as the amount of manual testing I need. Even though there are lots of tests, you as a developer still need to exercise your system manually, just not as much.

Also, coverage alone is not sufficient. Using descriptive and clear assertions can really speed up the understanding on why something failed. This is, of course, always handy, but especially so in those situations were a bug slipped into production and you need to make a fix for it fast.

Backwards compatibility

All changes we do must be backwards compatible. One aspect is that we want to avoid having to coordinate changes with other software (e.g. between our back-end API and our mobile application) as well as with software run by other teams. The latter is even harder as it involves people! Releasing changes independently of each other is a must if you want to release often, which we do.

Another aspect is data compatibility. Our application is fairly old with stored data that have not been updated in years, and a user must always be able to open and work on their old data. When introducing a change or new behaviour we always make it backwards compatible.

A pattern we use whenever possible is to transform data to a new structure when it is loaded. This helps limit the compatibility to the storage related parts of the code. Testing backwards compatibility can be tricky, though. We have added functionality to bypass parts of the system to insert data in a serialised format in order to inject “old” data from tests without having to keep an old database around.

Forward fixing

We never roll back a system to a previous version. Instead, when there is a critical issue, we always make a fix or workaround and make a new release. Managing roll-backs is a lot harder than going forward. E.g., if your new version introduced a change to persisted data, the old version must be able to read that without breaking. Likewise, if your service expose an API and there is a change in the protocol, your clients must be able to handle going back to the previous version as well.

From previous experience, I find it very easy to end up in a situation where you either cannot roll-back, or you make a roll-back just to later realize that you forgot a dependency and now have two fires to put out.

By always going forward, a whole category of problems goes away.

Root cause analysis

We all know that things go wrong every now and then. It is never a single person’s fault, even though it can sometimes feel like that when you have introduced an error. When there is a failure, it is the team’s responsibility. All changes require collaboration from your colleagues, at minimal a code-review. It is crucial to provide a safe environment where there is no blame when things go sideways, instead we all focus on how to avoid this situation again.

Whenever an issue is found in the system, a new test is added to ensure that it will not resurface. We also take a step back and try to understand why this error occurred. What action can be made to avoid this issue again? How can we make our domain model more rigid to effectively block this from happening again? Was the missing test an unfortunate mistake or are we missing a category of tests for similar situations?

One example of using software design to avoid errors is that our HTTP controllers must perform an authorization check or explicitly disable it, else the request will be failed by a middleware. This makes it very hard for us to miss adding authorization in the HTTP layer.

Another example is to use a test suite to rule out future errors based on type system. We use a C# attribute for classes that are serialised and deserialised. These classes are identified by a test-suite using reflection, and the test ensures that all such classes can be serialised and then deserialised. When you add a new such class it will be included in this test automatically. This has helped us to early identify when a property is not being serialised correctly, instead of becoming aware of this much later in the process where it is harder to figure out why something is missing.

Feature toggles

When new features are developed, they are often hidden behind a feature toggle. This way we can keep our branches and pull requests short-lived and continuously merge new features to our main line. We never want to hold back a release due to a new feature. That would only accumulate risk and before you know it you start doubting if it is risk free to make a deploy.

In our system, feature toggles are well integrated in the authorization mechanism. We can assign a feature to an individual, a role or a tenant. We use this to give early access to internal or friendly users (our internal users are also friendly, you know what I mean!) so they can provide us with feedback on how well our solution fits their problem.

A few weeks ago, we gave an internal user access to a new feature we built. Only one or two days later they reported back that it was broken! It turns out that the input data they used triggered a built in size limitation that the new feature was not adapted for. We had obviously missed a boundary test for this feature, which was added as a first thing. But luckily for us this was found by an internal user and corrected before the feature reached general availability.

Deploy often

We do not only deploy on Fridays, we deploy to production several times a week. We have automated the process so all you have to do is to request a deploy (using GitHub Actions) and one of your teammates needs to approve it. After that the deploy pipeline takes over, run all tests, generate changelog, tag release version and push to production.

The deploy request lists the changes made in a changelog style and all developers on the team can approve it. You might find it scary that anyone can approve a release request? Adding too much process here will both slow things down and introduce a threshold for making releases. The people that can approve are the same people that produce code and configuration changes via Pull Requests, since we trust them with that we trust them to make the decision on whether or not something should be pushed to production.

I really encourage you to adopt the tradition of deploying on Fridays! Once your organisation have this level of confidence in your system being functional and for your team to swiftly resolve any issues, you will have truly accelerated your ability to provide customer value.

If it hurts, do it more often.

/Markus Eliasson, Consultant