ETL process with Python and SQL Server taking a really long time to load

Question

I'm looking for a technique that will increase the performance of a csv file SQL Server database load process. I've attempted various approaches but nothing I do seems to be able to break the 5.5 hour barrier. That's just testing loading a year of data which is about 2 million records. I have 20 years of data to load eventually so loading data for 4 days straight isn't going to work.

The challenge is, the data has to be enriched on load. I have to add some columns because that information isn't native to the file. So far I've tried:

Using petl to append columns to the data and then flush that to the database.
Using pandas to append columns to the data and then flushing the data frame to the database.
Using bulk load to load an intermediary staging table and then using T-SQL to populate the extra columns and then pushing that on to a final staging table.

Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here. I'm getting ready to try:

Appending the data with Pandas.
Writing the data back out to a CSV.
Bulk loading the CSV.

This bothers me because I now have two I/O operations. Read the file into pandas and write the file back out again.

I read somewhere that Pandas was written in C or something so it should be really fast. Flushing a dataframe to the database wasn't that fast. At this point, I'm asking if anybody has a faster approach that they use in the real world. So far what i have is below:

import pypyodbc
conn_str = "DSN=[dsn name];"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = "BULK INSERT pre_stage_view FROM '[file path]' WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n')"
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

This is the stored procedure get rid of headers:

DELETE FROM pre_stage_table WHERE Symbol = 'Symbol'


INSERT INTO stage_table(
[Symbol],
[Exchange],
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
[SourceSystem],
[RunDate]
)
SELECT
[Symbol],
@exchange, --passed in proc parameter
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
'EODData',
CURRENT_TIMESTAMP
FROM pre_stage_table


TRUNCATE TABLE pre_stage_table

Lukasz Szozda · Accepted Answer · 2017-10-20T19:23:27.760

2

Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here.

Sorry but I do not understand why you have row level operations. Try:

1) bulk load to stage table

2) MERGE stage table with target table

You will still get set-based approach with presumably decent performance. Remember to disable triggers (if possible on target) plus you may drop indexes, load data and rebuild them after.

edited Oct 20 '17 at 19:23

answered Oct 20 '17 at 19:05

Lukasz Szozda

162,964
23
234
275

If I understand it right, when I flush from pre stage to stage, I'm doing a simple insert which inserts rows line by line. Let me try your suggestion. – Bob Wakefield Oct 20 '17 at 19:28
2

From five hours down to three minutes!! HAHAHAH!! NICE! FTW! Thanks! Never would have thought of using MERGE in that particular use case. – Bob Wakefield Oct 20 '17 at 22:12
@BobWakefield Excellent. I am really glad I could help :) – Lukasz Szozda Oct 21 '17 at 14:26
hi lad2025, I have the exact scenario as above, but in addition to insertion, i also need to update certain columns where applicable, and then push the data to my RDS Mysql. I need to perform upsert in my pyspark & staging table. can you please provide some help. Thanks – Yuva May 31 '18 at 04:55
@YuvaKumar MySQL does not have MERGE but has `ON DUPLICATE KEY` [How to Perform an UPSERT so that I can use both new and old values in update part](https://stackoverflow.com/questions/6107752/how-to-perform-an-upsert-so-that-i-can-use-both-new-and-old-values-in-update-par#6108484) – Lukasz Szozda May 31 '18 at 07:05
@YuvaKumar Sorry, but I've never used pyspark so I cannot confirm it. Feel free to ask new question with your case. I am sure you will get proper answer. – Lukasz Szozda May 31 '18 at 10:46

ETL process with Python and SQL Server taking a really long time to load

1 Answers1