June 24, 2024
Batching database changes for CI/CD pipelines
See Liquibase in Action
Accelerate database changes, reduce failures, and enforce governance across your pipelines.
This is one section of our complete 12-part guide to CI/CD for databases - get your copy.
In the previous section, we worked to understand the problems associated with bringing database changes into a CI/CD flow and break it into manageable and, hopefully, solvable chunks. Now we begin to address those chunks one at a time.
With that in mind, the first thing we have to do is get our inbound changes into some sort of structure that we can handle. If we cannot get this first piece organized, we will fall victim to the lesser-discussed truth of computing — “chaos in, chaos out”. (Closely related to “garbage in, garbage out” — but with more downtime.)
Changes and batches
First, we have to understand what a ‘change’ is and, from there, what a ‘batch’ is. For the purposes of a database, a change is simply a command that causes the contents of the database to be different after the command is executed. So, an ‘ALTER’ would typically be a change, but a ‘SELECT’ would not. (Well... unless you added an INTO, anyway.) In this context, a single change is an easier concept. Agile and DevOps practices like CI/CD get interesting when it comes to the notion of “batch size”.
The term “batch” comes from classic manufacturing. The notion is that you wanted to produce a large number of items. To do so at scale, you set up a machine to do the job, and then the machine would take identical raw material in and churn identical post-worked pieces out. You would do this in ‘batches’ because it was costly and time-consuming to set the machine up. This approach is economical when all the inputs and outputs are the same — whether they are cookies or car parts. With software and databases, you are never making the exact same change twice. Therefore, every change is effectively a unique batch of one.
SQL scripts are NOT automatically batches
The idea that every change is effectively a unique batch of one usually stimulates a conversation. Changes are typically packaged and tracked as SQL files but they can contain an unlimited number of statements. Also, the nature of the contained statements can alter how the SQL file is perceived.
For example, you perform 10 INSERTs into a table in a single SQL file.
- Is that one change or one batch of 10 changes?
- What if they were 10 ALTER statements?
- What if they were 10 ALTER statements, but three were for one table, five were for a second table and the remaining two were for a third table?
Extending the problem is the scenario where multiple SQL files exist, each containing a number of potentially unrelated changes.
These scenarios illustrate a common problem:
“We just track and review SQL scripts and then someone runs them against the database.”
This approach can be so wildly variable that it is very difficult and time-consuming to manage. This is why database change management tools such as Liquibase are precise in defining the boundaries of any given change they are tracking.
Explicitly defining what a “single change” means
Given the variable nature of database change types and the subjective nature of what a ‘single change’ is, the first step is to establish rules for what constitutes a single change.
For example, you might declare something like the following:
- Each DDL operation is a change.
- If multiple DDL operations on the same table are desired, each of them is still an individual change.
- Operations on stored logic objects, such as stored procedures, functions, etc. are represented as one stored logic object per script (similar to the best practice of ‘only one Class per .java file). Each stored logic script is a change.
- Multiple DML operations on the same table AND in the same SQL file can be considered a single change.
- Once a change has been applied to any database (including non-production databases) as part of a batch, it cannot be modified. It must be replaced or corrected with another change in a subsequent batch.
And so on…
Once you have established the definition of a ‘single change’, you must stringently enforce it — ideally by technological means. Get and use a code analysis tool. If something does not pass, the change contributor must refactor the submission — no exceptions.
That may sound harsh, but remember that these change scripts are intended to run against your Production database someday. So, why is it okay to let something slide?
Batches: Grouping database changes
It’s important to remember that because each specific change in software is effectively a one-off, the “ideal” batch size is exactly one. The idea is that there will be no dependencies within a group of changes if there is only one. It will either work or not.
If it does not work, it will be quickly apparent and it will be easy to find and fix because only one thing changed. This line of thinking underpins Continuous Integration, Continuous Delivery, and other DevOps practices.
As a practical matter, one-change batches are difficult to achieve in most organizations. There will need to besome compromises to the ideal — particularly in the early days of adopting the new practices. The key point is that smaller batches are better.
Building a batch of database changes
The application development world and common CI practices give us a clear pattern to use for defining batches. Every CI build, by definition, represents the production of a batch. Code changes come into a particular branch of the source code repository and a build engine picks them up, checks them over, builds them, runs some tests, and then, hopefully, delivers a compiled binary that includes the incremental new batch of changes.
To do this with database changes, we need to specify how we are going to do certain things every time we create a batch — and do them automatically as we do for a CI build for code.
- Collect changes
The most basic task is to decide how to collect the database changes. Database changes typically come in as code (.sql files), so they should be handled by a Source Code Management (SCM) system, such as Git. There are plenty of well-established patterns for merging changes in an SCM, and there is no need to reinvent them for database changes. (This may be obvious to some, but it bears repeating because there are many organizations where these files are processed as attachments to Change Management tickets by people who never use an SCM.)
Good practices are key for identifying new changes and ignoring old ones. SCM tools are great at facilitating this, but unlike application code, where the same files evolve over time through many builds, most database changes cannot be rerun. This is a central feature of a database change management tool such as Liquibase which can track which changes have been run, which ones can be rerun if necessary, and which ones have not been run. - Inspect changes
Each inbound change must be verified for compliance with the agreed rules defining a ‘single change.’ The CI process is an ideal time to use automation to scan and verify compliance. It also explicitly has an exception handling procedure — ‘breaking the build’ — to deal with problems. - Organize changes into a batch
Once changes pass inspection, they can be merged together, organized into a batch, and tested. This is the heart of the CI process where the inbound changes are assembled into a group, sequencing is finalized, and the change metadata is inspected for completeness. The Liquibase changelog file structure, for example, is designed to serve as an easily human-readable means of specifying all of these aspects before the changes get processed.
Regardless of the means used to define sequence and batch groupings, the important thing is that the sequence and grouping are treated as intrinsic to the batch of changes as the changes themselves. This is a crucial difference between database changes and application changes — database changes are cumulative.
- Create an immutable artifact
For applications, once you have created a ‘build’ of some code using a CI process, a binary or similar product exists. That binary represents the prior changes plus the new ones added as part of the build and is not going to change. In other words, the end product is important. To make a change, you would have to modify the source code, run a new build of the software, and get a new binary.
The CI process for the database changes also needs to produce an equivalent artifact. It can be as simple as putting the batch definition into a ZIP file and moving that file into some kind of repository, or it can be more complex. Whatever the containment solution, each batch’s contents must be known and remain unchanged once defined and verified by the CI automation.
Any change to something in a defined batch — even if it is just to override one thing — needs to go into a new batch. This way, it is easy to track which changes are present anywhere in the CI/CD pipeline by tracking the batches.
Expert insight:
Violating batch integrity should be viewed as a rare and serious exception. If a change works in one environment but not in a second, there is something wrong with one of the environments. Modifying the batch – say through a comparison or “diff” – may make the batch work in the second environment, but will not guarantee the batch will work in a third environment. With just a few environments in a chain that each require some kind of tweak to a batch, it is possible that wholly unplanned changes will be added and make it to Production. That can have quality, traceability, and security implications depending on the nature of the application or the business using it.
- Track changes
The point of being able to identify changes and batches individually is so that they are trackable. You can track who created them, which batch a change is in, which databases where a change has been applied, and so on.
Precise tracking is important for people managing the CI/CD pipeline, but it is absolutely critical to the automation around the pipeline. This tracking data is what allows the automation to decide whether changes are in an environment or not, whether to run or rerun a change, whether there are any gaps in the prior change history, and so on.
In other words, without good tracking, automation cannot function.
We’ve covered how to define database changes and create batches in source control so that database changes are usable and trackable.
The next problem we’ll tackle in CI/CD for databases is how to wrap automation around these batches of database changes. As we have learned, CI provides guidance, but each phase of batch creation brings challenges for automation.
Ready to dive into all 12 parts of the complete CI/CD for databases guide? It covers:
- The goals & challenges of designing CI/CD pipelines
- Batching database changes
- Automating database “builds”
- Ensuring quality database changes
- Reliable and predictable database deployments
- Trustworthy database environments
- Sandbox databases
- Database governance for consistency, quality, and security
- Observability for database CI/CD
- Measuring database CI/CD pipeline performance
- Handling and avoiding problems with database changes
- Bring CI/CD to the database with Liquibase