0

I am building a database from scraped data. I want to only insert records which are not yet in my database and disregard all the duplicate records.

Following is my code

conn = sqlite3.connect('Database.db')
df.to_sql("My_cool_data_Temp", conn, if_exists='replace', index=False)
query = "SELECT * FROM My_cool_data_Temp EXCEPT SELECT * FROM My_cool_data;"        
new_entries = pd.read_sql(query, conn)               
new_entries.to_sql("My_cool_data", conn, if_exists='append', index=False)       
conn.execute("DROP TABLE My_cool_data_Temp;")

This code gives me the following error:

sqlite3.IntegrityError: UNIQUE constraint failed:

Printing out new_entries I saw that for some reason the query doesn't seem to work as I get duplicates.

Any ideas what I might miss? I would be also glad for different approaches to my problem.

Thank you very much!

1 Answers1

0

Without seeing sample data, issue may be unique identifiers on all records. If so, avoid SELECT * FROM and instead explicitly define all columns. Otherwise, any unique identifier will not match to existing records and consequently cause all records in first select to return in EXCEPT. Also, use an insert-select and avoid second pandas call:

conn = sqlite3.connect('Database.db')

df.to_sql("My_cool_data_Temp", conn, if_exists='replace', index=False) 

query = """INSERT INTO My_cool_data (Col1, Col2, Col3, ...)
           SELECT Col1, Col2, Col3, ...
           FROM My_cool_data_Temp
           EXCEPT 
           SELECT Col1, Col2, Col3, ...
           FROM My_cool_data;
        """
conn.execute(query)

conn.execute("DROP TABLE My_cool_data_Temp")
conn.commit()

If columns are so many (which may be indicative of sub-optimal database design), build SQL with DataFrame.columns (removing unique identifiers). Backticks added for reserved words or column names with special characters and/or spaces.

df = df.drop(["unique_id_column"], axis="columns")

cols = "`, `".join(df.columns.tolist())

query = f"""INSERT INTO My_cool_data (`{cols}`)
            SELECT `{cols}`
            FROM My_cool_data_Temp
            EXCEPT 
            SELECT `{cols}`
            FROM My_cool_data;
         """
Parfait
  • 104,375
  • 17
  • 94
  • 125