7

How can I use the df.to_sql(if_exists = 'append') to append ONLY the unique values between the dataframe and the database. In other words, I would like to evaluate the duplicates between the DF and the DB and drop those duplicates before writing to the database.

Is there a parameter for this?

I understand that the parameters if_exists = 'append' and if_exists = 'replace'is for the entire table - not the unique entries.

I am using: 
sqlalchemy

pandas dataframe with the following datatypes: 
    index: datetime.datetime <-- Primary Key
    float
    float
    float
    float
    integer
    string <---  Primary Key
    string<----  Primary Key

I'm stuck on this so your help is much appreciated. -Thanks

Merv Merzoug
  • 1,149
  • 2
  • 19
  • 33
  • You are doing upsert, maybe you can refer to my question.https://stackoverflow.com/questions/51905337/how-to-use-pandas-to-do-upsert-in-sqlalchemy – giser_yugang Sep 05 '18 at 15:55

1 Answers1

19

In pandas, there is no convenient argument in to_sql to append only non-duplicates to a final table. Consider using a staging temp table that pandas always replaces and then run a final append query to migrate temp table records to final table accounting only for unique PK's using the NOT EXISTS clause.

engine = sqlalchemy.create_engine(...)

df.to_sql(name='myTempTable', con=engine, if_exists='replace')

with engine.begin() as cn:
   sql = """INSERT INTO myFinalTable (Col1, Col2, Col3, ...)
            SELECT t.Col1, t.Col2, t.Col3, ...
            FROM myTempTable t
            WHERE NOT EXISTS 
                (SELECT 1 FROM myFinalTable f
                 WHERE t.MatchColumn1 = f.MatchColumn1
                 AND t.MatchColumn2 = f.MatchColumn2)"""

   cn.execute(sql)

This would be an ANSI SQL solution and not restricted to vendor-specific methods like UPSERT and so is compliant in practically all SQL-integrated relational databases.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This was a great solution to the problem. Thank you. – Merv Merzoug Sep 06 '18 at 09:54
  • Great! Glad to help. Happy coding! – Parfait Sep 06 '18 at 14:36
  • actually would love to get your follow-up thoughts: it appears that I am still getting some duplicate errors. If my primary keys are, say 'Date', 'Instrument', 'Timeframe'. Then would those be the only columns in the "WHERE NOT EXISTS" argument? My script seems to be crashing on duplicates stil :-/ – Merv Merzoug Sep 06 '18 at 14:38
  • Those three columns will need to be added in the subquery of the `WHERE NOT EXISTS` clause to replace *MatchColumn#*. – Parfait Sep 06 '18 at 14:44
  • do we have to manually create the `myTempTable`? – DirtyBit Dec 09 '20 at 14:46
  • if myTempTable exists, it is replaced. If not it is created. Also, don't forget to commit the connection at the end. – GERMAN RODRIGUEZ May 09 '21 at 01:25