0

I have a script that reads a file as

with open(filepath + filename, 'r') as inFile: 
   for line in inFile:
      ...

the above can read (and process) a 7gb file with its different columns in ~3mn. My problem is in the inserts to the database, they are too slow. I tried mimicking the way a dataflow in SSIS performs db inserts but I didn't succeed (idk what they use too). I tried with pyodbc (classic execute and executemany but they're too slow even with fast_executemany), I also used bcpandas (the equivalent of bulk inserts with ~5K rows/s but I have also a lack of performance because I have to convert my data to a dataframe which takes half the time of the overall process or so. I also tried printing the data and running the script from ssis to use the inserts but the python prints is not meant for that and is too slow. I tried creating a new file to read in ssis but it takes a much time as an executemany.

Is there any other way of doing this that I missed? what is the way SSIS uses to perform dataflow inserts? my db: mssql

import pyodbc # tested executemany()
import time
import pandas as pd
import bcpandas
import sqlalchemy

# start time
startTime = time.time()

## file parameters
filepath = "C:/"
filename = "file.txt"

alchemy_eng = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=DRIVER={ODBC Driver 17 for SQL Server};Server=localhost;Database=test;Trusted_Connection=yes;')
bcpandas_eng = bcpandas.SqlCreds.from_engine(alchemy_eng)

### process file
counter = 0
insert_params = []

with open(filepath + filename, 'r') as infile:
  for line in inFile:
      # temp table partial structure
      col1,col2,col3,.. = line.split('|')
      if col2 == "valueA":
        col2 = None
      # couple more ..
      col3 = col3.replace(',', '.')
      # couple more ..
      # commit every million not all to avoid memory crash
      if counter == 1000000:
        df = pd.DataFrame(insert_params)
        df.columns = ['col1',...]
        bcpandas.to_sql(df, 'table1', bcpandas_eng, if_exists="append")
        # reset
        insert_params = []
        counter = 0
        break
      if counter == 1:
        cursor.execute(insert_query, insert_params)
        break
      # list of elements
      insert_params.append((col1,col2,col3,..),)
      counter += 1

# end time
endTime = time.time()
# elapsed time
print("elapsed time", endTime - startTime)
Sleepy
  • 180
  • 3
  • 15
  • Please explain better what you are trying to do and give a code you tried so far for insertion – gtomer Dec 14 '22 at 09:14
  • kind of a lame solution but what if you process your file then Save it to disk and then use bulk insert https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver16 – Atanas Atanasov Dec 14 '22 at 09:41
  • @AtanasAtanasov I tried saving the file but I waste too much time on just saving – Sleepy Dec 14 '22 at 09:44
  • @gtomer which solution would you like to see? i tried 3 – Sleepy Dec 14 '22 at 09:46
  • What do you mean by 'inserts to the database'? Adding rows? – gtomer Dec 14 '22 at 09:50
  • did you see this answer https://stackoverflow.com/questions/46684359/speed-up-insert-to-sql-server-from-csv-file-without-using-bulk-insert-or-pandas – Atanas Atanasov Dec 14 '22 at 09:53
  • @gtomer i added an example, but I do adding new rows to a db. – Sleepy Dec 14 '22 at 10:06
  • @Atanas Atanasov I ll check the straight file read using pandas – Sleepy Dec 14 '22 at 10:06
  • from the example code you shared. You don't seem to be doing much processing. Just straight bulk insert the file. If you still need to do processing bulk insert to a temporary table do the edits there and then copy the table you need. – Atanas Atanasov Dec 14 '22 at 10:11
  • @AtanasAtanasov with what would I do the edits to the data, I have simplified a bit the code sample. would it be faster with a procedure? – Sleepy Dec 14 '22 at 10:36
  • not sure what edits you have. but generally, I would do the edits with SQL queries after the insertion – Atanas Atanasov Dec 14 '22 at 10:47
  • @AtanasAtanasov my problem is in the inserts, the sql driver is too slow compared to the ssis data flow (i just lose too much time despite winning in data process time) – Sleepy Feb 16 '23 at 16:04

0 Answers0