CI/CD with Databricks and Azure DevOps

So you’ve created notebooks in your Databricks workspace, collaborated with your peers and now you’re ready to operationalize your work. This is a simple process if you only need to copy to another folder within the same workspace. But what if you needed to separate your DEV and PROD?

Things get a little more complicated when it comes to automating code deployment; hence CI/CD.

What is CI/CD?

A CI/CD pipeline.

Continuous integration (CI) and continuous delivery (CD) embody a culture, set of operating principles, and collection of practices that enable application development teams to deliver code changes more frequently and reliably. The implementation is also known as the CI/CD pipeline and is one of the best practices for devops teams to implement.

Implementing CI/CD in Azure Databricks

Example of continuous integration and delivery

The flow is simple:

  1. A developer develops
  2. Then he checks in his code into source control
  3. The developer branch is then pushed to the master branch
  4. The code is then pushed to various environment; DEV; TEST; PROD

For the purpose of this blog, I will demonstrate how a notebook committed to Azure DevOps can be automatically pushed to another branch (folder) of the same workspace.

Note, the target workspace can be another Databricks environment. But to simplify the explanation, I used the same one

Prerequisites

Make you you have the following :

A Databricks workspaceYou can follow these instructions if you need to create one.
An Azure DevOps project / Repo See here on how to create a new Azure DevOps project and repository.
A Sample notebook we can use for our CI/CD example This tutorial will guide you through creating a
sample notebook if you need.

Binding your DevOps Project

Next, you will need to configure your Azure Databricks workspace to use Azure DevOps which is explained here.

Syncing your notebooks a Git Repo

When you open your notebook, you will need to click on Revision history on the top right of the screen. By default, the notebook will not be linked to a git repo and this is normal.

Notebook not linked to a repo

If you click on “Not linked”, you will be asked a few things:

StatusYou will need to set this to “Link”
LinkChange the values encapsulated by <> with the
appropriate information.
BranchLeave this to master for now. But should follow
your companies DevOps best practices
Path in Git RepoFolder in Git where the notebook will be created
Note, if you use the classic URL for Azure DevOps, the organization name is part of the URL as per this example: https://< myOrg >.visualstudio.com. If you use the new format, you will find it as per: https://dev.azure.com/< myOrg >.

If done properly, you will see the “Git: Synced” status

Notebook linked to a Git repo

Once synced, you will need to save your changes to Git.

Building your CI / CD Pipeline in Azure DevOps

Now that you committed your notebook to Azure DevOps, it’s time to build your CI/CD pipeline.

What is an Azure Pipelines? It’s a fully featured continuous integration (CI) and continuous delivery (CD) service. It allows you to build, test and deploy your code to any platform. Add more parallel jobs for even faster pipelines. Build and deploy on Microsoft-hosted Linux, macOS and Windows.

Well be concentrating on build and release; leaving test for another blog 🙂

Creating a new build pipeline

A continuous integration trigger on a build pipeline indicates that the system should automatically queue a new build whenever a code change is committed.  This is the CI portion of our process.

Selecting your project, you will be brought into the project summary where you will see various option on the tool bar on the right. The one that is important for us now is the Pipelines one. Clicking on “Pipelines” will reveal a sub menu and start off with build.
You will then need to click on “New” and then “new build pipeline”.

Creating a new build pipeline

Doing so will bring you to the creation screen

Create new build pipeline
SourceUnless you want to / need to link to another source, keep the
default to “Azure Repos Git”
Team ProjectPick the project you created
RepositoryPick the repository you created
Default branchKeep “Master” as the default branch

Pressing “Continue” will bring you to the template selection screen. The pipeline we’re building involves pulling the changes from the master branch and building a drop artifact which will then be used to push to Azure Databricks in the CD part. Knowing this, pick “Empty job”.

Name your build pipeline ADBDemo-CI and pick Hosted VS2017 pool.

You can pick other types of pool. For example Linux which tends to be faster or even your own private pool if you had one. See here for more information regarding pools.

Next, you will click on the “+” sign next to “Agent job 1” which will bring up the list of tasks available to you. Search and add “Publish Build Artifacts”.

Once added, click on the added task and your screen should now look like this:

Adding a new task.

A few things of importance to note on this screen; the Path to publish and the Artifact name.

The task selected will pull artifacts from your Git repository and create a package which will be used by our release pipeline. The Path to publish indicates which folder in your git repository you would like to include in your build. Clicking on the ellipsis will let you browse your repository and pick a folder.

As you browse you will be able to select an individual file. Even if doing so will let you build successfully, the release task we’ll use to push back to Databricks only supports folders. Hence the need to pick a folder and not a file.

The Artifact name identifies the name of the package you will use in the release pipeline. You can keep it as “drop” for simplicity.

Queuing your build

Once you’ve defined your build pipeline, it’s time to queue it so that it can be built. This is done by selecting the “Save & queue” or the “Queue” options. Doing so will ask you save and commit your changes to the build pipeline.

Note, the text you enter when committing you changes will be used to identify the builds
View build progress

After submitting your build to the queue, you can monitor the progress by clicking the # item on the top left as per the following

Once completed you should see green check marks on all the steps

Completed build.

Viewing the content of the build

You can inspect the content of your build by clicking the “Artifact” blue button on the top right. Doing so will allow you the browse the content like so:

Artifact explorer

Building the release pipeline

Now that we have a build created, lets setup the delivery portion of the CI/CD. In order to do this, you will go back to the “Pipelines” menu and select “Release” and then “new release pipeline”.

Creating a new release pipeline.

Like before, you will select an empty template.

Once that is done, you will need to configure 2 sections; artifacts and stages. In the Microsoft documentation, an artifact is described as a deployable component of your application. It is typically produced through a Continuous Integration or a build pipeline and a stage is described as a logical and independent entity that represents where you want to deploy a release generated from a release pipeline.

Example of a release pipeline.

Again, for simplicity, we’ll create 1 stage; let’s call it DEV. If done right, your screen should look like this:

Setup up a new stage.

Configuring Artifact

Before configuring the stage, we need to specify the artifacts that will be used for this pipeline. This can be done by click “+ Add” in the artifact block and specifying the following:

Add new artifact.
Source typeThe type will be Build
ProjectSelect the project you created earlier
SourcePick the name of the artifact build in the
build pipeline. Should be ABDDemo-CI
Default versionPick Latest from the build pipeline default
branch with tags
Tagsleave blank
Source aliasA unique name to identify the artifact in the
stage portion of this pipeline. Default of
_ABDDEmo-CI should do.
Note, if you don’t select latest, it will prompt you every time you run your pipeline for your build version. It was getting annoying after a while 🙂

Click “Add”

Configuring Stage

Next, we’ll need to add a task to your DEV stage. This can be done by clicking on the “1 job, 0 task” link in the DEV box and then the “+” sign next to “Agent job”.

In the search box of the add task screen, search for Databricks and you should see a task available in the marketplace called “Databricks Script Deployment Task by Data Thirst”. This tool will give you the option of deploying scripts, secrets and notebooks to Databricks. You can see here for more details on the tool.

Go ahead and click install.

Once done, you should see new tasks available to you. Select “Databricks Deploy Notebook” and click “Add”

Adding the Databricks task.

Now we need to configure the newly added task as per:

Configure Databricks Deploy Notebook task.
Display nameLeave default name
Databricks bearer tokenYou will need to generate a new user token
and paste it here. See this article on how to
generate a user token.
Azure RegionYou can grab that off your Databricks workspace URL.
Example, mine is https://canadaeast.azuredatabricks.net/
Source files pathClick on the ellipsis and browse your linked artifact
and pick the folder you want pushed back to
Databricks. I picked the “drop” folder
Target files pathSpecify the folder in your Databricks workspace
you want the notebook import to.
Keeping your token clear text is not best practices. I would suggest using Azure key vault to store and retrieve your token.

The final step is to create a release by clicking the “+ Release” drop down and select “Create a new release”. Click “Create” on the next screen.

Creating the release

Creating a new release.
See progress.

Like before, click the “Release-#” link on the top left to see the progress.

If you verify the logs and you release ran successfully, you should see green check marks all across the board. I strongly suggest you go through each steps and look at the outputs as it help understand what’s going on behind the hood.

Successful execution.
Not sure if you’ve noticed, but I have an extra step in my task which is fetching a secret out of Azure Key Vault. As mentioned above, I do this in order not to have my Databricks token pasted in the clear. Just a tip to make things more secured 🙂

All is done!

If all ran well, you should now see the notebook inside the “/Shared” folder of your Databricks workspace

19 thoughts on “CI/CD with Databricks and Azure DevOps

  1. Thanks for this very interesting post and a good news about azuredevops support in databrik
    May be a missing point to complete the déploiement , how to link the déployed notebook to the cluster ??

    1. This is actually not tied to any cluster. The CI/CD pipeline only moves your code (Notebook) from one environment to another. Attaching and running the notebook can be accomplished as part of the release pipeline but you will need to us a Batch script in your task and then install and use the Databricks CLI. Then you will need to create and run a job.

    1. Technically speaking, you could use Azure DevOps from any cloud but I’m not sure if that feature is available on AWS.

  2. Hi How would you implement the testing task in this CI process? Pytest cannot be ran from a notebook so we have used Unttest and xmlrunner but how do you configure it for databricks notebooks?

    1. Haven’t figured out testing yet. I’m able to run a notebook as part of the test task but it runs as a job which is asynchronous. I would need to loop while polling the jobs API until I get a completion. Open to ideas 🙂

      1. Hi Benji, I have this working now with unittest and xmlrunner which outputs the results of the testsuite in JUnit Xml to a mount point. Submit the notebook as a job via Powershell and poll the job run until it completes. Then if successful copy the file locally (or in a pipeline to the build agent) and then use a Publish Test results task. The way we’ve got our Test suite up and running using unittest it reports the error in the notebook for the meta data driven test (via ddt). It will be my first blog post as nobody has done it yet.

  3. Hi Benjamin,
    Article is awesome. I’m bugging my head to solve the CICD for databricks before looking at your post. I’m planning to implement the same for multiple environment in single build. Like- I have 3 Environments Dev, Test Prod. If change the code in Dev environment same will be changes in other environment notebooks. could you please help me with you inputs.

    1. That’s done in your DevOps release. Simply add more stages; one for every environment. Each stage pushed to a different Databricks workspace.

      1. Thanks Benjamin,
        I’m not understand why we are coping the databricks code to shared folder. what is the use of it?

  4. Hello Benjamin,

    Thanks a lot for this great article !
    I am trying to implement CI/CD in a project using Databricks and Data Factory.
    The users will work on the Databricks notebooks in Dev. Once they are happy with the changes, they save it and it triggers a build that moves everything to Uat for now.
    My questions would be :
    * at which steps should I test the notebooks ? Build or Release ?
    * How should I proceed to sync with Data Factory’s own CI/CD ?

    Thanks for your lights !
    Cheers,

    1. Great question. I would test the notebooks as part of the release steps. As for ADF, as this already integrates with Git, you should be able to use Azure DevOps’s git repo and use it’s CD process to auto generate builds. As for releases, you could use the ADF Rest API to push to other environment.

Leave a Reply