I have a .csv
file with 600 million plus rows. I need to upload this into a database. It will have 3 columns assigned as primary keys.
I use pandas
to read the file in chunks of 1000 lines.
At each chunk iteration I use the
INSERT INTO db_name.dbo.table_name("col1", "col2", "col3", "col4")
VALUES (?,?,?,?)
cursor.executemany(query, df.values.tolist())
Syntax with pyodbc
in python to upload data in chunks of 1000 lines.
Unfortunately, there are apparently some duplicate rows present. When the duplicate row is encountered the uploading stops with an error from SQL Server.
Question: how can I upload data such that whenever a duplicate is encountered instead of stopping it will just skip that line and upload the rest? I found some questions and answers on insert into table from another table
, or insert into table from variables declared
, but nothing on reading from a file and using insert into table col_names values()
command.
Based on those answers one idea might be:
At each iteration of chunks:
- Upload to a temp table
- Do the insertion from the temp table into the final table
- Delete the rows in the temp table
However, with such a large file each second counts, and I was looking for an answer with better efficiency.
I also tried to deal with duplicates using python, however, since the file is too large to fit into the memory I could not find a way to do that.
Question 2: if I were to use bulk insert
, how would I achieve to skip over the duplicates?
Thank you