Airflow + SQL Server = 💘

Dockerfile for Airflow + SQL Server (mssql)

Airflow + SQL Server = 💘

My latest project involved productionalizing a Kubernetes deployment of Airflow leveraging Bitnami's helm charts. Bitnami has done a great job creating a generalized Airflow deployment. However, it lacks the ability to connect to SQL Server (including Synapse) - a major requirement for my client.

I have to admit that I had never heard of Airflow before this engagement. Airflow was originally created by Airbnb to design, schedule, and monitor ETL jobs. It's become an important tool in the Data Scientists tool-belt. Anyway, I had to quickly come up to speed on the topology and figure out how to deploy into a production environment!

Initially it appeared that SQL Server (mssql) support would be available out-of-the-box. Unfortunately, despite the apache-airflow-providers-microsoft-mssql provider being included in the Bitnami image, it doesn't actually work. Bad timing, it would seem, as mssql support in Airflow is in a state of migration. The aforementioned  apache-airflow-providers-microsoft-mssql provider has been deprecated in favor of the apache-airflow-providers-odbc provider. Easy enough, I thought. A quick pip install apache-airflow-providers-odbc to the new provider and... no luck.

Installing collected packages: pyodbc, azure-storage-blob, azure-mgmt-datafactory, apache-airflow-providers-odbc, apache-airflow-providers-microsoft-azure
    Running install for pyodbc: started
    Running install for pyodbc: finished with status 'error'
    ERROR: Command errored out with exit status 1:
    unable to execute 'gcc': No such file or directory
    error: command 'gcc' failed with exit status 1
no compiler available for pyodbc

Soon, I found myself down the rabbit hole of missing binaries. It quickly became apparent I would be customizing the image. First step, of course, was to add a new task into the sprint to appease the agile overlords (kidding, I love agile).

After a few hours of work, I was able to produce an Airflow Dockerfile that extends Bitnami's image to include the binaries and libraries necessary to connect to SQL Server.

Since there are three seperate Airflow images (web, worker, and scheduler), I parameterized the Dockerfile. Passing the base image name as a build-arg makes it easy to create each extended image.

docker build --build-arg IMAGE=airflow -t mcgough/airflow:latest .
docker build --build-arg IMAGE=airflow-worker -t mcgough/airflow-worker:latest .
docker build --build-arg IMAGE=airflow-scheduler -t mcgough/airflow-scheduler:latest .

Finally, I couldn't find a working sample DAG to test mssql so I created this quick example DAG to test connectivity.