Version Driven Branching (part 1)

We recently reached a consensus in my team. The idea was that we need to start prioritising the production environment. This meant that we would have to get used to diagnosing problems there, instead of spending our time building expertise in debugging the software on our local machines and other environments that were not directly mimicking the production environment.

The thought process was largely driven by Cindy Sridharan's article on Medium, which largely suggested dropping Pre-production testing and investing in Post-production testing. One of the main points that intrigued us was that this solution would mean that we wouldn't require a build and deployment industry, which we have had in other projects. Don't get me wrong, there are many cool things you can do with a team dedicated to build and deployment, but it will mean that your company will inadvertently end up spending a lot of money supporting a team that is not working on your business domain. We made a bit of compromise with the article. We would have 2 sets of environments - "Test" and "Production", each with 2 slots (+1 for staging and -1 for rollback).

Tests ran in our Production staging environment (Prod +1) would let us check whether the code would work. If the tests passed, the slots would be switched around and the new code would be running in production. The Test environment, which was to be an exact replica of the Production environment - slots and all - was there to ensure that the deployment mechanism worked. Slot swapping could be tested in the Test environment before we subjected the Production environment to the same process.

Next we started talking about git branching strategies, and suddenly we were no longer agreeing.

Gitflow

A colleague had recently read about Gitflow and was very keen on adopting this strategy. Many projects in TfL use a similar branching strategy (albeit using Team Foundation Server), so it was not so much a new concept to them and we know it definitely works to an extent. In Gitflow, you branch from the Master branch to a Develop branch, then once you are happy with a particular version of the Develop branch, you promote it to a Release branch, which can then be merged back into the Master branch. The Master branch should contain the exact code that is on the production system, which should help with diagnosing issues in production due to the fact you know exactly what is there and you can perform hot fixes on the Master branch while the Develop and Release branches may contain changes you do not want to adopt yet.

We considered allowing our automated deployments to move the code to varying parts of our environment. Any branch could be tested on Test +1, the Develop branch would be tested on Test +1 then on Test. Release branches would go through the Test environments and then reach the Prod +1 stage and the Master branch would go through all the environments up to Production.

Most tests would most likely be running on the Test environment, as the Develop branch is most likely the branch that most code changes would go to and release branches would most likely be created quite rarely. Our aim of trying to get good at testing our code on Production (or at least as close to as possible) was failing, since most of our testing would happen on a different environment.

Removing Release branches and merging code straight from the Develop branch to the Master branch was proposed as a solution. We could allow the Develop branch to now go as far as Prod +1, meaning the tests would be run very often, very close to the actual Production system (although never on the production system). The issue with this idea is that at the point that the slots are changed between Prod +1 and Prod, one cannot be certain that it was definitely from the Master branch. If the Master branch was moving through its deployment steps and had reached Prod +1 but was still waiting before it switched slots, if someone checked something into develop, it could replace the Master branch code in Prod +1, and when the slots were switched, there would be code from the Develop branch in production. This issue could also occur in the case where Release branches are used - Release branch code could similarly end up in Production if someone had wanted to deploy it to do some testing while someone else was busy trying to get the Master branch into Production.

The easiest way to solve the problem would be via process - enforcing branching freezes while Master was on the deployment journey would definitely ensure that this issue couldn't occur.

Backing away from the solution in a way to try to avoid this problem, we tried to consider what environments we would need to be able to keep Release environments and not let code get accidentally dropped into Production.

Of course, this means that an additional environment is needed for the Release branch. It looks like one environment for now, but at the moment that there are multiple releases queued up, this will no doubt result in multiple environments, one for each release [1]. That would be the only way to ensure the quality of the code in each branch, in case something goes wrong during the wait till final deployment.

Physical deployment constraints

Having a look at the process of the code's journey through all the environments, one can get a good idea of the (best case) time it would take for code to get from a developer's machine to Production. If tests take time t to run and are run on each environment (and +1 slot) and each branch goes to an additional environment while still going through the previous environments, the Release branch with its dedicated environment has added 6t, which is double the best-case time before the Release branch was added. If these tests take just 30 minutes to run, this would result in a situation where code cannot possibly reach production in less than 6 hours, while the release environment itself doesn't seem to add any value, since the branched code had to go through the Test environment as well.

The alternative to this would be to allow a branch's code to only run on its own environment - Develop would run on Test, Release on the Release environment and Master on the Production environment. At this point, we no longer have the initial intended check of being able to check the current code whether it could go through the slot switch process in Production via a seperate complete Environment (as in the Test environment as first proposed) unless we are sure the merge from Release to master branch was definitely correct and we have had a branching freeze [2].

Alternatives to Gitflow

Interestingly, atlassian (who hosted the Gitflow article previously mentioned) also has articles on other branching strategies that can be used in projects. The Feature workflow ultimately results in an environment per feature, while the Forking workflow will result in an environment per developer. Both of these alternatives mean that we will not avoid an industry of build and deployment. On the other hand, the Centralised workflow naturally fits a scenario with few environments. It would have the benefit in that code would only need to go through deployments to each environment only once and the Master branch would consistently be updated with the latest code and make it to Production fast. It would be easy to set up a policy to ensure that slots are only switched between Prod and Prod +1 if Prod +1 goes through some additional sanity tests. This would mean that you may need branching freezes as mentioned before, but it definitely doesn't require the spinning up of any additional environments.

Versioning Culture

The greatest outcry against the idea of Centralised workflow is that it may become hard to manage versioning of the code. What do you do when something goes wrong in Production? How do you hotfix Production, when the branch has already progressed so far since then? It is no doubt possible to have a process where a branch's version is easily associated with the Production code - it could be branched from that point and hotfixed. However, the concept of a Centralised workflow is better suited to a situation where code is released fast into Production - each check in can end up in Production if all the tests are running green in all the deployment steps in its journey. If the tests are going green, why shouldn't the code be in production? Additional tests can be run Post-production, as suggested in the original Medium article and we have reached that solution with the possibility of keeping the additional compromise-environments that give us more assurance on the final deployment steps. Of course, a thing that is necessary for such a process to work is incredible trust in the developers on the project. You cannot ensure that a developer doesn't remove or ignore valuable tests, you can only leave that to trust. Enforced peer reviews can help with this trust - dodgy code would have had to make it through automated tests as well as the steely gaze of at least two professional code-ninjas.

In terms of version management itself, most businesses are not particularly interested in version numbers. In fact the only time they ask about version numbers is when they are asking about what features are present in a version number. The only versioning information that developers should have to pass on to businesses/clients is the time that a feature is added. Features should be added to systems in a way that allows them to be loosely coupled and for the features to go through canary testing before they go live properly. This means that most features will originally have to be developed with feature switches that are initially turned off and are only turned on when the entire system is functionally capable of supporting that feature. The version number/id of the branch starts to become irrelevant to the business, since there may be multiple features that are already in Production, but are not necessarily turned on. In the meantime, tests can be set up to turn features on, test them, then turn them off again (and these tests don't necessarily have to run in Production). In this way you can test multiple features being added to a codebase, while avoiding the need to add additional environments per feature, or per release or per developer.

Branch sanity

As in most cases in software development, there are trade-offs. More dedicated branches means we can add more controls to the code, but it also means you require more environments to ensure that branch is in a fit state. More environments and branches results in more time diagnosing issues on systems that are not your production system, although they DO mean that you get more assurance your code will work, if you are keeping them in a constant green state. The right balance between control of your code and the management cost and time required to keep that solution sustainable are a choice that need to be made within your team to fit the trust you have in your code and process and undoubtedly there is no right or wrong answer.

This article is part 1 of a two-part series. Click here to read on.

Clarifications

1. I may be extrapolating a bit here, but it is easy to imagine the scenario - Release 42 is ready, but suddenly there is an issue in the Production code. While hotfixes are being found and applied, someone in another team has managed to merge enough code to create Release 43. If a version manager insists that something in Release 42 needs to be out in the open before Release 43 and Release 43 branch has already been created and therefore has overriden the Release 42 code in environment x, the environment would have to be loaded up with Release 42 code again. At this point someone will suggest that maybe it might be worthwhile having a seperate environment for Release 42 and both branches could be monitored and could have the same hotfixes that the Production code needs applied to them both (or it could be shown that a Release Y may not need the hotfix). Ultimately, the point is that if code does not make it to the Master branch fast enough, Release branches will stack up and the only way to ensure the quality of each branch, it would make sense to have an environment per branch that may soon need to be released. If your code moves fast and you merge and push code often you shouldn't have this issue, but the additional environment and the fact that more branches means that there are more merges and pushes required already puts physical limits to how fast you can get your code to the Master branch.

2. The initial plan was to have Test +1 and Test to try to mimic the switch between Prod +1 and Prod to make sure the switching process will work. If the Master branch is only deployed to Prod +1, tested and then switched into Production, we cannot be sure that this process will work if the branch the code had been merged from had already completed this process on a replicated environment. This is due to the fact that the Merging process itself can result in errors. For example, if a hotfix had been applied to the Master branch and then also added to a Release branch, at the point that the Release branch is merged into Master, a conflict could have been poorly resolved due to human error. I may be splitting hairs over a rare and unlikely scenario, but isn't that part of the fun?

published: Wed Mar 14 2018

New book!

My new book The Perfect Transport: and the science of why you can't have it is now on sale on amazon, or can be ordered at your local bookstore.

Michal reads books, solves equations and plays instruments whenever he isn't developing software for Lowrance, B&G, Simrad and C-MAP. His previous work at TfL has left a lingering love for transport.

Michal Paszkiewicz