Python, Pandas, SQLAlchemy, SQL Server and Docker
In this article I am going to walk through a simple application that makes use of Pandas, SQLAlchemy and SQLServer on Docker. This combination is part of a project I have been working on recently that is using complex Cash flow analytical computation process that moves financial data from one stage to another, massaging, transforming, merging with various sources and finally calculating to generate final data to store in the SQL DB.
Here, I am going to explain steps that needed to performed to create an environment and will put together a small app that demonstrate how one can utilize these technologies together. The app will load data from a csv file into a Pandas' DataFrame and then save it into SQL Server DB using pyodbc and SQLAlchemy.
Set up
Before we start lets create development environment by setting up SQL Server 2017 database on a Docker container were we are going to save final data for persistence or to be consumed by other applications like reporting tools. Then we need to set up ODBC drivers for Linux so we can connect to DB from our app.
SQL Server on Docker
Download docker image
docker pull microsoft/mssql-server-linux
Start a new container and specify DB sysadmin password
docker run -itd -e 'ACCEPT_EULA=Y' -e 'MSSQL_SA_PASSWORD=ChangePassw0rd!' \ -p 1433:1433 --name sql \ microsoft/mssql-server-linux:2017-latest # Change password to a new one sudo docker exec -it sql /opt/mssql-tools/bin/sqlcmd \ -S localhost -U SA -P "ChangePassw0rd!" \ -Q 'ALTER LOGIN SA WITH PASSWORD="NewPassw0rd!"'
Check image started
docker ps
Create a new DB
Attach to the image and connect to the DB
khalid@ubuntu:~/docker/sqlserver$ docker exec -it sql "bash" root@9fe0f2c60294:/# /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P "NewPassw0rd!"
Once you are in create test DB and a table
1> create database findb; 2> go 1> use findb; 2> go Changed database context to 'findb'. 1> create table daily_rates (RateDate date, Country varchar(256), Value decimal(20,8)); 2> go 1> select * from daily_rates; 2> go RateDate Country Value ---------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------- (0 rows affected)
We will load data later on using pandas and SQLAlchemy
Install SQL ODBC drivers for Linux
Following steps are copied from Microsoft's official documentation. And choose instructions relevant to your distribution. I use Ubuntu Bionic Beaver (18.04), so second option in my case is relevant. But you can find instructions for other distributions as well on the above URL.
sudo su curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - #Download appropriate package for the OS version #Choose only ONE of the following, corresponding to your OS version #Ubuntu 16.04 curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list #Ubuntu 18.04 curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list #Ubuntu 19.10 curl https://packages.microsoft.com/config/ubuntu/19.10/prod.list > /etc/apt/sources.list.d/mssql-release.list exit sudo apt-get update sudo ACCEPT_EULA=Y apt-get install msodbcsql17 # optional: for bcp and sqlcmd sudo ACCEPT_EULA=Y apt-get install mssql-tools echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc source ~/.bashrc # optional: for unixODBC development headers sudo apt-get install unixodbc-dev
Create ODBC settings file odbc.ini
Get SQL server's Docker image's IP address by inspecting the docker bridge network (assuming you followed my instructions above and started it on the default bridge network)
khalid@ubuntu:~$ docker network inspect bridge ... "ConfigOnly": false, "Containers": { "9fe0f2c60294271e19fc3d27f59d56e3ecaf40afd5f5698452b3335a5a573678": { "Name": "sql1", "EndpointID": "e8afe3352d4a02b38340413ba559a8b8e388eea6412316b75f626bd21632791a", "MacAddress": "02:42:ac:11:00:02", "IPv4Address": "172.17.0.2/16", "IPv6Address": "" }
First create a temporary file like so
khalid@ubuntu:~$ echo "[MSSQLServerDatabase] Driver = ODBC Driver 17 for SQL Server Description = Connect to my SQL Server instance Trace = No Server = 172.17.0.2" > odbc.tmp
Install ODBC DSN settings
khalid@ubuntu:~$ sudo odbcinst -i -s -f ~/odbc,tmp -l
Inspect if installation was successful. You should see the same lines as above in the below config file:
khalid@ubuntu:~$ cat /etc/odbc.ini [MSSQLServerDatabase] Driver=ODBC Driver 17 for SQL Server Description=Connect to my SQL Server instance Trace=No Server=172.17.0.2
Create a new Python virtual environment
khalid@ubuntu:~/dev/python_code/pd_sql$ python -m venv venv # and activate khalid@ubuntu:~/dev/python_code/pd_sql$ source venv/bin/activate (venv) khalid@ubuntu:~/dev/python_code/pd_sql$
Install Pandas, SQLAlchemy and pyodbc
While you are in the activated virtual environment install required packages
(venv) khalid@ubuntu:~/dev/python_code/pd_sql$ pip install pandas sqlalchemy pyodbc
NOTE: While installing above you might encounter with following pyodbc compilation error. This is due to pyodbc need code compilation when installed onto POSIX (Linux in this case) systems and fails when can't find reuired C++ headers.
-g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.28 -I/home/khalid/dev/python_code/pd_sql/venv/include -I/usr/include/python3.8 -c src/buffer.cpp -o build/temp.linux-x86_64-3.8/src/buffer.o -Wno-write-strings In file included from src/buffer.cpp:12:0: src/pyodbc.h:45:10: fatal error: Python.h: No such file or directory #include <Python.h> ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
FIX: I have created a fix and filed a pull request to the owner and waiting for resolution. While it's getting solved you can pull the fixed package from my repo and install: https://github.com/khalidmammadov/pyodbc
~/dev/python_code/pd_sql$ git clone https://github.com/khalidmammadov/pyodbc (venv) khalid@ubuntu:~/dev/python_code/pd_sql/pyodbc$ git checkout compile_for_venv Branch 'compile_for_venv' set up to track remote branch 'compile_for_venv' from 'origin'. Switched to a new branch 'compile_for_venv' (venv) khalid@ubuntu:~/dev/python_code/pd_sql/pyodbc$ cd ../ (venv) khalid@ubuntu:~/dev/python_code/pd_sql$ pip install pyodbc/ Processing ./pyodbc Installing collected packages: pyodbc Running setup.py install for pyodbc ... done Successfully installed pyodbc-4.0.29b2+compile.for.venv (venv) khalid@ubuntu:~/dev/python_code/pd_sql$
Pandas app
Now we can start building an app that makes use of all above installed technologies. For testing purposes I am going to use test exchange rates data that can be downloaded from https://github.com/khalidmammadov/exchange-rates/blob/master/data/daily.csv
Download or clone the code I have prepared as per below
https://github.com/khalidmammadov/python_code
(venv) khalid@ubuntu:~/dev/$ git clone https://github.com/khalidmammadov/python_code (venv) khalid@ubuntu:~/dev/$ cd python_code/pd_sql
Below is the main part of the app from __main__.py. Here App class is used to hold the state of the application in one place and enrich it with various app related shared attributes and methods. It takes various arguments from command line and main being password to the database if you are running it in Linux.
CSV data is read into pandas DataFrame using delegated static method from App class. Then data type for date column is corrected due to underlying library bug (only on Linux for pyodbc package). Then data is ready to be saved into the database
__main__.py
# Initialise a global app # this will hold the state and components of the app app = App(log_flag=logging_enabled, debug_flag=debug_enabled, debug_dir=debug_dir, dbpwd=dbpwd) # Main implementation csv_path = Path('~/docker/test/exchange-rates/data') daily_csv_file = 'daily.csv' # Read file to a DataFrame csv_df = app.get_csv_data(csv_path.joinpath(daily_csv_file)) csv_df = csv_df.astype({'Date': 'datetime64'}) csv_df = csv_df.rename(columns={'Date': 'RateDate'}) csv_df = csv_df.fillna(0) # Check the data app.debug_df('Rates from CSV:', csv_df) # Save Rates Into DB app.save_rates_into_db(csv_df)
app.py
App define various logging and debug tasks first and then defines methods for the actual app tasks.
class App(object): def __init__(self, log_flag=None, debug_flag=None, debug_dir=None, dbpwd=None): self.logger = App.set_up_logger(log_flag, debug_flag) self.debug_enabled = debug_flag self.debug_dir = None self.config = load_config('conf.ini') dsn_name = self.get_config('DSN', 'Name') self.db_connection = new_connection(dsn_name, dbpwd) .... .... @timing def get_config(self, section, name): return get_config(self.config, section, name) @staticmethod def get_csv_data(file): return pd.read_csv(file) @timing def save_rates_into_db(self, rates_df): # Get rates from DB rates_db_df = get_all_rates(self.db_connection) rates_db_df = rates_db_df.astype({'RateDate': 'datetime64'}) # Left Outer Join so we check for existence joined = pd.merge(rates_df, rates_db_df, left_on=['RateDate', 'Country'], right_on=['RateDate', 'Country'], how='left', suffixes=['', '_table']) # Separate Inserts and updates cond = joined['Value_table'].isnull() inserts = joined[cond] updates = joined[~cond] # Check self.debug_df('Inserts:', inserts) self.debug_df('Updates:', updates) if inserts.index.size > 0: insert_rates(self.db_connection, daily_rates_tbl, inserts) if updates.index.size > 0: update_rates(self.db_connection, daily_rates_tbl, updates) def __del__(self): if self.db_connection: self.db_connection.close() del self.db_connection
def save_rates_into_db
This method reads from db first (for update purposes later in the code) and converts data into a DataFrame. Then data from CSV and DB is joined to decide what need to be inserted and what for update.
Then inserts and updates are separated and ready to go into the DB.
Below functions from lib submodule are doing actual insertions and deletions using SQLAlchemy syntax.
def insert_rates(connection, table, inserts): # Make dictionary of parameters for the inserts ins_params = make_upsert_list(inserts) connection.execute( table.insert(), ins_params) def update_rates(connection, table, updates): # Make dictionary of parameters for the updates upd_params = make_upsert_list(updates) # Execute statement (will autocommit) # Generate single updates for uld in upd_params: connection.execute( table.update(). where( and_( table.c.RateDate == uld['RateDate'], table.c.Country == uld['Country'])), upd_params)
In both cases the data first converted into a list of dicts so it can be parameterized for the DML operations.
def make_upsert_list(df): keys, data = df_to_sqla(df) return [dict(zip(keys, row)) for row in data]
Also the conversions:
def convert_type(val): if isinstance(val, pd.Timestamp): val = val.to_pydatetime() elif isinstance(val, np.int64): val = int(val) elif isinstance(val, np.float64): val = float(val) return val def df_to_sqla(df): keys = df.columns ncols = len(keys) nrows = len(df.index) print(df.dtypes) data_list = [] for r in range(nrows): data_list.append([]) for c in range(ncols): d = None val = df.iloc[r, c] val = convert_type(val) data_list[r].append(val) return keys, data_list
Conclusion
This article demonstrates how one can spin up a database on a Docker container and then connect to it using Python (pyodbc, SQLAlchemy) and manipulate data as required using Pandas very easily.
Thanks for reading and hope you liked this article and learned something new today!
References
Below you can find reference articles that I have used in various stages
https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-SQL-Server-from-Linux