0

I need to insert a lot of data inside a PostgreSQL database. These data are stored into a CSV file.

I've created a python script for inserting data (they can be inserted directly because I need to manipulate data first).

At some point I've something like this

connection = psycopg2.connect(connection data...)
cursor = connection.cursor()

for row in rows:
  # calculate values from row, then...
  try:
    query  = "insert into public.my_table (x, y, z) values (" + x + ", " + y + ", " + z, ");"
    cursor.execute(query)
  except:
    manage_error()
  finally:
    cursor.commit()

This code works. The problem is that it commits every insert so I want to speed up things by commit changes only after the entire csv is loaded (many thousands of rows).

I've tried:

connection = psycopg2.connect(connection data...)
cursor = connection.cursor()

for row in rows:
  # calculate values from row, then...
  try:
    query  = "insert into public.my_table (x, y, z) values (" + x + ", " + y + ", " + z, ");"
    cursor.execute(query)
  except:
    manage_error()
# Commit after inserting all rows
cursor.commit()

It almost works.

The problem is that I've some row in csv files that have some error giving me a duplicated key. This way the exception is raised. Even if I continue to the next row, PostgreSQL aborts the entire transaction so at the end I don't commit any row of the file.

I need to find a way to use the transaction and ignore errors inside the transaction itself, so I can ignore rows with error (after handled them in the exception) and continue to import all the valid lines of the csv.

I've about 20 rows with error in csv files with about 800.000 rows, so aborting the transaction for these few data is a pain, and from the number of rows you can imagine that commit them one a time is quite slow. Think also that I've a very big amount of these files to import.

I've seen the ON_ERROR_ROLLBACK feature that can solve my problem, but unfortunately I cannot use it since it's a feature of psql client, and I don't use it because I use the python script with psycopg2 library.

So, how can I ignore error in transaction without using ON_ERROR_ROLLBACK, since I cannot use it?

Jepessen
  • 11,744
  • 14
  • 82
  • 149
  • 1
    `Insert ... ON CONFLICT DO NOTHING` might help. Available since v9.5: https://www.postgresql.org/docs/9.5/sql-insert.html – Maurice Meyer Oct 28 '19 at 15:13
  • 1
    It's likely better to not use a loop at all. Copy from your csv file into a temporary table using copy_from or copy_expert, remove the duplicate rows from the temp table, then insert into your actual table from the temporary table. – Jeremy Oct 29 '19 at 07:50

0 Answers0