3

I am trying to import CSV file into PostgreSQL using Python. I want to tell Python to only import the records/rows of data in CSV file which do not contain the duplicates (only the first unique builder record). I have attached the code I am using to find the duplicates in the CSV file. I am new to programming so please pardon my inexperience.

import csv
import psycopg2

database = psycopg2.connect (database = "***", user="***")

cursor = database.cursor()
delete = """Drop table if exists "Real".Noduplicates"""
print (delete)

mydata = cursor.execute(delete)

cursor.execute("""Create Table "Real".Noduplicates
            (UserName varchar(55),
            LastUpdate timestamp,
            Week date,
            Builder varchar(55),
            Traffic integer
            );""")

print "Table created successfully"

csv_data = csv.reader(file('SampleData2.csv'))

already_seen = set()

next(csv_data)

for row in csv_data:
    builder = row[3]
    if builder in already_seen:
        print('{} is a duplicate builder'.format(builder))
    else:
        print('{} is a new builder'.format(builder))
        already_seen.add(builder)

for row in csv_data:

    cursor.execute("""INSERT INTO "Real".Noduplicates (UserName, LastUpdate, Week, Builder, Traffic)"""\
                    """VALUES (%s,%s,%s,%s,%s)""",
           row)

cursor.close()
database.commit()
database.close()

print "CSV Imported"
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Jake Wagner
  • 786
  • 2
  • 12
  • 29
  • 1
    I would change in your table Builder to be unique. This means your database will reject all duplicates to this key. Then just loop through your data, try inserting every record and catch the exception if it fails. Let database do the work for you. – Hannu Nov 02 '16 at 16:14
  • @Hannu: Exactly. And with the new UPSERT (`INSERT ... ON CONFLICT ...`), you don't even get exceptions. – Erwin Brandstetter Nov 02 '16 at 17:13

1 Answers1

3

Instead, import all rows to a temp table using COPY (much faster) and then INSERT a distinct set into the target table from there, using SELECT DISTINCT ... or SELECT DISTINCT ON (builder) or use aggregate functions). Finding dupes is a forte of any RDBMS.

Related code examples:

Or, while inserting rows one by one, the new UPSERT in Postgres 9.5 and a UNIQUE index on all 5 columns is all you need:

But note the special role of NULL values (which are never considered equal in SQL):

If the column builder is all you need to identify dupes, just define that column UNIQUE:

...
builder varchar(55) UNIQUE NOT NULL,
...

and use a simple UPSERT:

INSERT INTO "Real".noduplicates (userName, lastUpdate, week, builder, Traffic)
VALUES (%s,%s,%s,%s,%s)
ON CONFLICT ON (builder) DO NOTHING;

Better don't use CaMelCase names in Postgres btw.:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks Erwin, isn't there anyway I can tell Python to only send the first unique Builder record into database? I want to get rid of the dupes in Python. – Jake Wagner Nov 02 '16 at 16:17
  • @Pythoner: How would Python know which rows are in the table without looking it up? – Erwin Brandstetter Nov 02 '16 at 16:21
  • The data source is the CSV file, could it be possible to tell Python to only import the unique Builder records and then insert into postgreSQL? – Jake Wagner Nov 02 '16 at 16:23
  • 1
    @Pythoner: Sure, possible. But you would be trying to re-implement the sophisticated algorithms Postgres uses to identify dupes in a set. – Erwin Brandstetter Nov 02 '16 at 16:25
  • The reason why I want to rid of them duplicates in Python is because I want to automate the process in the future and with Python this can easily be done, I don't know if any process can be automated in postreSQL. – Jake Wagner Nov 02 '16 at 16:27
  • 1
    If the "builder" field is unique in your database, all inserts of the same key will fail. If you want to do this in your program for some reason, you need to do the select as indicated in this answer. I would suggest using the powerful tool you have (a relational database) instead of trying to fight it with vigour. If your table is or becomes large, allowing DB to handle duplicates will be much more efficient than doing this in Python, as database does indexing instead of sequential searches in lists etc. – Hannu Nov 02 '16 at 16:29