0

Following this advice to write pandas dataframe into a SQL database using PyODBC. In a gist, this is the SQL code for table creation.

CREATE TABLE [HumanResources].[DepartmentTest](
[DepartmentID] [smallint] NOT NULL,
[Name] [dbo].[Name] NOT NULL,
[GroupName] [dbo].[Name] NOT NULL
)
GO

and this is how you write records from a pandas dataframe into the table.

import pyodbc
import pandas as pd
# insert data from csv file into dataframe.
# working directory for csv file: type "pwd" in Azure Data Studio or Linux
# working directory in Windows c:\users\username
df = pd.read_csv("c:\\user\\username\department.csv")
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'yourservername' 
database = 'AdventureWorks' 
username = 'username' 
password = 'yourpassword' 
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()

It works, but the code is extremely rigid with specific column data type and even the column names are hardcoded. To meet my use-case I need a more dynamic scenario where a table would be created on the SQL database based on well formed pandas dataframes and then all rows will be pushed to it. The column names, types and even the table name will only be known at execution time.

This seems like a very standard operation, so are there any readymade pandas/pyodbc function to cater to this? Basically, I am asking for code to write a pandas dataframe to sql (maintaining the column data types, i.e. objects become string, integers become integers etc.) without assuming the column names.

Della
  • 1,264
  • 2
  • 15
  • 32
  • 1
    You are inserting into SQL DB row-by-row, which is a guarantor of poor performance. Have you looked at [`to_sql`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html)? You will also need to a process to clean up table name and conflict resolution if all tables are to be created at run time – Code Different Dec 01 '22 at 12:21
  • Thanks, yeah, those concerns occurred to me, but myself being very new at SQL, not able to do it in a clean way, however the MS advice (my link) does look something ver inefficient. Also, the ```to_sql``` interface apparently does not support pyodbc. That's why the question, how to do it in the cleanest possible way? Even in pandas or spark etc. we never do row-by-row operation, it's unbelievable that I have to do it row-by-row when writing to a sql database. – Della Dec 01 '22 at 12:25
  • Aside... I'd avoid user-defined data types for character-based columns if you can. I know you can do it but it prevents you from specifying collations and that can complicate your life down the road if you need to support international characters in specific ways. – AlwaysLearning Dec 01 '22 at 12:29
  • 1
    That's very bad advice from Microsoft. SQLAlchemy (the one used by `to_sql`) has support for PyODBC: https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc – Code Different Dec 01 '22 at 12:48
  • "the to_sql interface apparently does not support pyodbc" - That is not true. See [this answer](https://stackoverflow.com/a/63178240/2144390) for an example. – Gord Thompson Dec 01 '22 at 17:19

0 Answers0