0

I have a .csv file with 600 million plus rows. I need to upload this into a database. It will have 3 columns assigned as primary keys.

I use pandas to read the file in chunks of 1000 lines.

At each chunk iteration I use the

INSERT INTO db_name.dbo.table_name("col1", "col2", "col3", "col4")
VALUES (?,?,?,?)

cursor.executemany(query, df.values.tolist())  

Syntax with pyodbc in python to upload data in chunks of 1000 lines.

Unfortunately, there are apparently some duplicate rows present. When the duplicate row is encountered the uploading stops with an error from SQL Server.

Question: how can I upload data such that whenever a duplicate is encountered instead of stopping it will just skip that line and upload the rest? I found some questions and answers on insert into table from another table, or insert into table from variables declared, but nothing on reading from a file and using insert into table col_names values() command.

Based on those answers one idea might be:

At each iteration of chunks:

  1. Upload to a temp table
  2. Do the insertion from the temp table into the final table
  3. Delete the rows in the temp table

However, with such a large file each second counts, and I was looking for an answer with better efficiency.

I also tried to deal with duplicates using python, however, since the file is too large to fit into the memory I could not find a way to do that.

Question 2: if I were to use bulk insert, how would I achieve to skip over the duplicates?

Thank you

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ilyas
  • 609
  • 9
  • 25

4 Answers4

0

You can try to use a CTE and an INSERT ... SELECT ... WHERE NOT EXISTS.

WITH cte
AS
(
SELECT ? col1,
       ? col2,
       ? col3,
       ? col4
)
INSERT INTO db_name.dbo.table_name
            (col1,
             col2,
             col3,
             col4)
            SELECT col1,
                   col2,
                   col3,
                   col4
                   FROM cte
                   WHERE NOT EXISTS (SELECT *
                                            FROM db_name.dbo.table_name
                                            WHERE table_name.col1 = cte.col1
                                                  AND table_name.col2 = cte.col2
                                                  AND table_name.col3 = cte.col3
                                                  AND table_name.col4 = cte.col4);

Possibly delete some of the table_name.col<n> = cte.col<n>, if the column isn't part of the primary key.

sticky bit
  • 36,626
  • 12
  • 31
  • 42
  • Where are the values to be inserted that are read from the csv file? – ilyas Jul 30 '18 at 00:58
  • @ilyas: In the Python client, like in your current code (The Python code doesn't need to be changed for this, except for the embedded query string of course.). – sticky bit Jul 30 '18 at 10:06
  • I get the following error when I apply your code: `SystemError: returned a result with an error set` – ilyas Jul 30 '18 at 11:22
0

I would always load into a temporary load table first, which doesn't have any unique or PK constraint on those columns. This way you can always see that the whole file has loaded, which is an invaluable check in any ETL work, and for any other easy analysis of the source data.

After that then use an insert such as suggested by an earlier answer, or if you know that the target table is empty then simply

INSERT INTO db_name.dbo.table_name(col1,col2,col3,col4)
SELECT distinct col1,col2,col3,col4 from load_table
TomC
  • 2,759
  • 1
  • 7
  • 16
0

The best approach is to use a temporary table and execute a MERGE-INSERT statement. You can do something like this (not tested):

CREATE TABLE #MyTempTable (col1 VARCHAR(50), col2, col3...);

INSERT INTO #MyTempTable(col1, col2, col3, col4)
VALUES (?,?,?,?)

CREATE CLUSTERED INDEX ix_tempCol1 ON #MyTempTable (col1);

MERGE INTO db_name.dbo.table_name AS TARGET
USING #MyTempTable AS SOURCE ON TARGET.COL1 = SOURCE.COL1 AND TARGET.COL2 = SOURCE.COL2 ...
WHEN NOT MATCHED THEN
    INSERT(col1, col2, col3, col4)
    VALUES(source.col1, source.col2, source.col3, source.col4);

You need to consider the best indexes for your temporary table to make the MERGE faster. With the statement WHEN NOT MATCHED you avoid duplicates depending on the ON clause.

0

SQL Server Integration Services offers one method that can read data from a source (via a Dataflow task), then remove duplicates using it's Sort control (a checkbox to remove duplicates).

https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/

Of course the data has to be sorted and 60 million+ rows isn't going to be fast.


If you want to use pure SQL Server then you need a staging table (without a pk constraint). After importing your data into Staging, you would insert into your target table using filtering for the composite PK combination. For example,

Insert into dbo.RealTable (KeyCol1, KeyCol2, KeyCol3, Col4) 
Select Col1, Col2, Col3, Col4 
  from dbo.Staging S 
 where not exists (Select *
                     from dbo.RealTable RT 
                    where RT.KeyCol1 = S.Col1
                      AND RT.KeyCol2 = S.Col2
                      AND RT.KeyCol3 = S.Col3
                  )

In theory you could also use the set operator EXCEPT since it takes the distinct values from both tables. For example:

INSERT INTO RealTable
SELECT * FROM Staging
EXCEPT
SELECT * FROM RealTable

Would insert distinct rows from Staging into RealTable (that don't already exist in RealTable). This method doesn't take into account the composite PK using different values on multiple rows- so an insert error would indicate different values are being assigned to the same PK composite key in the csv.

Zorkolot
  • 1,899
  • 1
  • 11
  • 8