0

This is more of a conceptual question. I'm building a relational db, using python and the psycopg2 library, and have a table that has over 44 million rows (and growing) that I want to try and inject sanitized rows from a csv file into the table without injecting duplicate rows; each row has an auto incrementing unique id from it's origin db table.

The current way I'm injecting the data is using the COPY table(columns...) FROM '/path/to/file' command; which is working like a charm. This occurs after we've sanitized all rows in the the csv file to match the datatypes in the rows to the appropriate column's datatypes in the table.

There are a few ideas that I have in mind, and one I've tried, but want to see what the most efficient option is before implementation.

The one I tried ended up being a tremendous burden on the server's cpu and memory; which we have decided not to proceed on. I ended up creating a script that makes a query to the db that searches for the unique id in the table (over 44 million rows).

My other idealistic solutions:

  1. Allow injection of duplicates then create a script to clean up any duplicate rows in the table.

  2. Create a temporary table with the data from the csv. Compare the temp table with the existing table, removing any duplicate values from the temp table, then injecting the temp table into the existing table.

  3. Step 2 might be simplified with this issue. Instead of comparing the two tables we just use the INSERT INTO command along with the ON CONFLICT option.

  4. This one might be more of a stretch of the imagination, and probably pretty unique to our situation. But, since we know that the unique id field will be auto incrementing, we can set a global variable to equal the largest unique id value in the table, then before sanitizing the data we make a query to check if the unique id value is less than the global variable data, and if that is True, we throw out the row from being injected. (No longer an option)

Kevin G
  • 168
  • 2
  • 15
  • 2) Could be done as an INSERT that only inserts into the final table those rows in the staging table that do not exist in the final table. 3) `INSERT ... ON CONFLICT` is a possible solution, though be aware if you have a PK that is a `sequence/IDENTITY` it will increment on all the rejected rows so you could end up with big gaps in the PK values. – Adrian Klaver Apr 04 '22 at 22:31
  • If it's not important which version of a duplicated row is kept a temporary table with a copy of the permentnat table with a unique constraint on the columns which should not be duplicated would sanitise your data use a native sql function. You could then insert the new rows into the permanant table without creating gaps in the primary key sequence. –  Apr 05 '22 at 05:26

0 Answers0