I'm looking for a technique that will increase the performance of a csv file SQL Server database load process. I've attempted various approaches but nothing I do seems to be able to break the 5.5 hour barrier. That's just testing loading a year of data which is about 2 million records. I have 20 years of data to load eventually so loading data for 4 days straight isn't going to work.
The challenge is, the data has to be enriched on load. I have to add some columns because that information isn't native to the file. So far I've tried:
- Using petl to append columns to the data and then flush that to the database.
- Using pandas to append columns to the data and then flushing the data frame to the database.
- Using bulk load to load an intermediary staging table and then using T-SQL to populate the extra columns and then pushing that on to a final staging table.
Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here. I'm getting ready to try:
- Appending the data with Pandas.
- Writing the data back out to a CSV.
- Bulk loading the CSV.
This bothers me because I now have two I/O operations. Read the file into pandas and write the file back out again.
I read somewhere that Pandas was written in C or something so it should be really fast. Flushing a dataframe to the database wasn't that fast. At this point, I'm asking if anybody has a faster approach that they use in the real world. So far what i have is below:
import pypyodbc
conn_str = "DSN=[dsn name];"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = "BULK INSERT pre_stage_view FROM '[file path]' WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n')"
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
This is the stored procedure get rid of headers:
DELETE FROM pre_stage_table WHERE Symbol = 'Symbol'
INSERT INTO stage_table(
[Symbol],
[Exchange],
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
[SourceSystem],
[RunDate]
)
SELECT
[Symbol],
@exchange, --passed in proc parameter
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
'EODData',
CURRENT_TIMESTAMP
FROM pre_stage_table
TRUNCATE TABLE pre_stage_table