August 6, 2020

Sandbox Databases

This is the eigth post in a series we’re calling CI/CD:DB. Read from the beginning.

To this point, we have largely focused on processing database changes once they have been created. For this post, we are going to go all the way back to the creation of the change itself. Not because we need to discuss the best generation tool, IDE, or text editor for creating SQL scripts, but because once we have an automated CI/CD pipeline, it puts pressure on the database change creators to get things right the first time.

The sandbox database concept

No one wants to be the person who “broke” the CI build or to cause a disruption in the flow of changes from other change creators. That means that we must equip them with as much tooling as possible (or at least practical) to enable them to make good decisions as they create the database changes. As with the overall CI/CD pipeline, this is not a revolutionary concept — it exists for a lot of other parts of an application stack and is generically called the developer’s “sandbox”. It is a place where a developer can safely experiment and understand the impacts of their changes on the system without disrupting the work of others. So, in that sense, a sandbox is just a copy of a ‘real’ pipeline database that runs on the same database engine version, contains an object structure that accurately reflects the pipeline (including stored logic), and contains just enough data to run tests to verify any new changes.

Key capabilities for making sandbox databases available

The concept sounds simple, but reliably making those things available to a change creator, such as a developer, can be a bit more challenging. Some of those problems are beyond the scope of this series. For example, the license economics of some databases can make provisioning sandboxes challenging. The good news is that many of the other problems become much easier once you have the environment management portion of the CI/CD pipeline sorted.

Beyond the simple conceptual aspects of a sandbox database, there are three key capabilities that must be present for the sandbox to be truly effective:

  • The capability to do an on-demand refresh of the sandbox
  • The sandbox must be at a known point
  • The whole provisioning and refresh process must be fast

On-demand refresh

The first capability that a sandbox must have is that the individual change creator must be able to refresh it themselves when they need to do so. Consider that they are deliberately experimenting to figure out what any given change should look like. The nature of experiments means that they fail often. Further, the failure is, by definition, unpredictable. This means that the need for a refresh will happen at an unpredictable time. Given that a change creator’s productivity is related to their ability to experiment, they cannot simply wait for a daily, weekly, monthly, or some other long-interval refresh. It must happen as needed.

At a known point

The next thing that the sandbox must be is current to a known point every time. On the surface, that is a straightforward thing in that you can implement a refresh from a snapshot or a similar construct. The nuance comes from the fact that snapshots age and cannot always be refreshed immediately. That means that there will always be a number of changes that precede the one the creator is working on but they are not yet present in the snapshot. So, your system for refreshing sandboxes starts to look a lot like your system for refreshing the CI or test environments. The main difference is that because of the experimental nature of the sandbox environment, you may need to give the change creators some control over which of the changes are applied in a refresh.

Speed

Finally, as with all things, speed is a factor. The more quickly a sandbox can be set up, experimented in, and reset as needed, the better. This means less time for a change creator to wait to verify their work. This has a direct positive impact on the number of experiments they can perform, their ability to iterate toward a solution, their general productivity, and likely their overall morale.

Keeping sandboxes in perspective

The most important thing to remember is that no sandbox database can be perfectly representative of the ‘real’ database. A perfect sandbox would be a full copy of PRODUCTION with all the latest ‘under test’ changes from the pipeline associated with the codestream the developer is working on. That is, of course, impractical for so many reasons. Just a few examples:

  • Size
  • Data privacy requirements
  • Network / load balancing constructs
  • Sheer capacity costs

As such, everyone must expect that a sandbox is never authoritative — it is a local productivity tool to help the change creator make good decisions about the changes they are creating. The integration environment at the CI point must always take precedence over what is ‘correct’ because it is the first representative and front-line defender of the pipeline. If that sounds like a bit of a hard line, that is because the pipeline includes PRODUCTION.

This is not really so different from the principle that a developer saying ‘it works on my machine’ does not matter if it does not work in the build or QA environments. If a change works in one environment and not in another, there is a serious problem with environment management that must be fixed lest there be a serious productivity impact for the whole team.

Summing up

Well-managed sandbox databases are a productivity boon for a development team. A key part of getting the most from them is ensuring that you have matching capabilities for the database layer of your system. In the next post, we’ll explore guardrails for CI/CD.

Share on: