I have a script that reads a file as
with open(filepath + filename, 'r') as inFile:
for line in inFile:
...
the above can read (and process) a 7gb file with its different columns in ~3mn.
My problem is in the inserts to the database, they are too slow.
I tried mimicking the way a dataflow in SSIS performs db inserts but I didn't succeed (idk what they use too). I tried with pyodbc (classic execute
and executemany
but they're too slow even with fast_executemany
), I also used bcpandas
(the equivalent of bulk inserts with ~5K rows/s but I have also a lack of performance because I have to convert my data to a dataframe which takes half the time of the overall process or so. I also tried printing the data and running the script from ssis to use the inserts but the python prints is not meant for that and is too slow. I tried creating a new file to read in ssis but it takes a much time as an executemany.
Is there any other way of doing this that I missed? what is the way SSIS uses to perform dataflow inserts? my db: mssql
import pyodbc # tested executemany()
import time
import pandas as pd
import bcpandas
import sqlalchemy
# start time
startTime = time.time()
## file parameters
filepath = "C:/"
filename = "file.txt"
alchemy_eng = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=DRIVER={ODBC Driver 17 for SQL Server};Server=localhost;Database=test;Trusted_Connection=yes;')
bcpandas_eng = bcpandas.SqlCreds.from_engine(alchemy_eng)
### process file
counter = 0
insert_params = []
with open(filepath + filename, 'r') as infile:
for line in inFile:
# temp table partial structure
col1,col2,col3,.. = line.split('|')
if col2 == "valueA":
col2 = None
# couple more ..
col3 = col3.replace(',', '.')
# couple more ..
# commit every million not all to avoid memory crash
if counter == 1000000:
df = pd.DataFrame(insert_params)
df.columns = ['col1',...]
bcpandas.to_sql(df, 'table1', bcpandas_eng, if_exists="append")
# reset
insert_params = []
counter = 0
break
if counter == 1:
cursor.execute(insert_query, insert_params)
break
# list of elements
insert_params.append((col1,col2,col3,..),)
counter += 1
# end time
endTime = time.time()
# elapsed time
print("elapsed time", endTime - startTime)