Integrating Data Masking into a CI/CD Pipeline

Have you ever thought about how you can protect sensitive data in your application? One way to do this is by using data masking, which is the process of obscuring sensitive data to prevent unauthorized access or disclosure. In this blog post, we'll talk about how to integrate data masking into a CI/CD pipeline using Azure DevOps as an example.

There are several reasons why it is important to integrate data masking into a CI/CD pipeline:

1. Protect sensitive data: Data masking helps to protect sensitive data, such as personal information (PII, PHI) or confidential business information, from unauthorized access or disclosure. You can ensure that sensitive data is properly masked in all environments, including development, staging, and production.

2. Comply with regulations: Depending on your industry and location, you may be required to comply with different regulations (GPDR in the EU, HIPAA in USA or PIPEDA in Canada or if you want find more, this article is interesting (https://securityscorecard.com/blog/countries-with-gdpr-like-data-privacy-laws) that require the protection of sensitive data. By integrating data masking into the CI/CD pipeline, you can ensure that you are meeting these requirements in all environments.

3. Remove unintentional errors: Incorporating an automatic process helps you get rid of any possible manual error that could occur in manual masking processes. Human intervention can cause these kinds of errors that come with associated costs.

4. Remove unnecessary delays: the incorporation of these types of processes within a CI/CD pipeline will guarantee the provision of data in due time, avoiding delays in the performance of tests and also in the supply of these data to the QA team.

Step 1: Identify the data that needs to be masked

The first step when integrating a data masking process into your CI/CD pipeline is figuring out what data needs to be masked. This usually includes personal information like names, addresses, and social security numbers, as well as any sensitive business info that shouldn't be shared with unauthorized parties.

Once you know what data needs to be masked, you need to find out where it's stored within your application. This could be in a database, a configuration file, or hardcoded into the application itself.

Step 2: Choose a data masking tool

There are a bunch of data masking tools out there, both open-source and paid. Some examples are Anonymizer, Mask My Data, and Dataguise. When picking a data masking tool, consider factors like compatibility with your CI/CD pipeline and application, cost, and what types of data it can mask.

Step 3: Set up the data masking tool

Once you have chosen a data masking tool, integrate it with your CI/CD tool using some API endpoints and specify the data that should be masked during the pipeline execution.

In this case, we will use Gigantics to easily create an anonymized dataset that will be used to run tests within a CI/CD pipeline.

Using the Gigantics interface, we will create a unique URL with an API key pointing to a dataset containing realistic data. Simply copy the URL and invoke it from your CI/CD pipeline configuration file to get the dataset (an .sql file).

Step 4: Integrate into your CI/CD pipeline

You'll need to add a step that runs the data masking tool as part of the pipeline. In Azure DevOps, this can typically be done by adding a shell script step that calls the data masking tool with the appropriate arguments.

For example, here is a Azure DevOps pipeline that loads an anonymized dataset into a mysql database and executes some tests.

trigger:
  - master

pool:
  vmImage: ubuntu-latest

steps:
  - task: NodeTool@0
    inputs:
      versionSpec: '14.x'
    displayName: 'Install Node.js'

  - script: |
      sudo /etc/init.d/mysql start
      sudo /etc/init.d/mysql start
      mysql -e 'CREATE DATABASE employees_test;' -uroot -proot
      mysql -e 'SHOW DATABASES;' -uroot -proot
    displayName: Start mysql database

  - script:
      curl -k https://<GIGANTICS_URL>/dataset/<APIKEY> | mysql -uroot -proot
      employees_test
    displayName: Load test dataset from Gigantics

  - script: |
      npm install
      npm run build
      npm test
    displayName: 'npm install & unit tests'
    workingDirectory: 'client/'

  - script: |
      (npm start &)
      ./node_modules/.bin/cypress run
    displayName: 'run cypress tests'
    workingDirectory: 'client/'

In this example, this Azure DevOps file is executed every time a developer pushes to their github branch. When this happens, a docker container is generated with ubuntu, mysql and node. We initialize mysql and create an empty database where the test data will be loaded from Gigantics. Once the anonymized data are in place, the tests are run. In this case, we use the Cypress framework to perform the tests.

Subsequently, the pipeline would continue according to the requirements of each client. This pipeline is designed for Azure, but it can be extrapolated to any CI/CD tool since the data from Gigantics are brought in just by using the

curl

command.

Step 5: Test the data masking process

After integrating the data masking tool into your CI/CD pipeline, it is important to test the process to ensure that it is functioning properly. You can do this by running a test build of your CI/CD pipeline and checking the output of the data masking step to ensure that the data is being masked correctly.

You should also test the application to ensure that it is functioning properly with the masked data. This may involve running tests on the masked data or manually verifying that the application is behaving as expected with the masked data.

By thoroughly testing the data masking process, you can ensure that it is working properly and that sensitive data is being properly protected.

Step 6: Monitor and maintain the data masking process

It is important to regularly monitor the data masking process to ensure that it is functioning properly and masking the correct data. This may involve checking the output of the data masking tool to ensure that the masked data is correct, as well as monitoring the application to ensure that it is functioning properly with the masked data.

By integrating a data masking process into your CI/CD pipeline, you can ensure that sensitive data is properly masked in all environments, helping to protect it from unauthorized access or disclosure.

In conclusion, integrating data masking into a CI/CD pipeline is a good way to ensure that sensitive data is properly protected in all environments. By following the steps outlined in this blog post, you can effectively integrate a data masking tool into your pipeline.