1

I am working on a pandas dataframe and with sqlalchemy . What I am trying to do is to merge multiple identical databases into one. The problem is that the tables may have duplicates between DBs. Another problem is that I have something around 30M rows and this causes time loss when working around the dataframe (even with chunksize).

I tried to iterate all over the dataframe but that's way too long for checking duplicates. So I wanted to use the SQL query INSERT ... ON DUPLICATE KEY ... do something but the only possibilities are UPDATE and IGNORE. What I am trying is to :
->insert every rows -> on duplicate key -> update the row I tried to insert and insert it with new id
I couldn't find anything with sqlalchemy for this problem on google

What i'm doing:

import pandas as pd

df = pd.DataFrame()
#put all datas in the df for one tables but from all databases

engine = create_engine("mysql://xxx:xxx@localhost/{db}".format(db=dbname))

df.to_sql(con=engine, name=tableName, if_exists='append', chunksize=chunksize, index= False)

And the MYSQL error is DUPLICATE VALUE FOR PRIMARY KEY

EDIT: Adding the table schema

table1 = Table('table1', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    mysql_engine='InnoDB'
    )

table2= Table('table2', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
    )

table3= Table('table3', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
    Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table4= Table('table4', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table5= Table('table5', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table6= Table('table6', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )
Hanggy
  • 25
  • 9
  • 2
    When you do an INSERT ... ON DUPLICATE KEY .. UPDATE, it does what you say you need, inserts every row and if finds a duplicate key, updates the values you need – nacho May 21 '19 at 08:01
  • http://www.mysqltutorial.org/mysql-insert-or-update-on-duplicate-key-update/ This does not say the same as you – Hanggy May 21 '19 at 08:22
  • That tutorial says exactly the same as i say. Maybe you didn't understand it. The system tries to insert a row. If it does not exist as a primary key, inserts the row and continues. If the primary key exists, updates the columns you specify after the UPDATE and then continues. https://dev.mysql.com/doc/refman/5.5/en/insert-on-duplicate.html – nacho May 21 '19 at 08:30
  • `If you specify an ON DUPLICATE KEY UPDATE clause and a row to be inserted would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row occurs.` That is the first line of your link. I don't want to update the old row but to add a new one with a modified key – Hanggy May 21 '19 at 08:38
  • That ´ s not what you said in your question. You talked about updating the row, no updating the primary key and insert it as a new row – nacho May 21 '19 at 08:42
  • Yeah i updated the question so that is clearer, imeant update the row I was actually inserting sorry was confusing – Hanggy May 21 '19 at 08:43
  • What you can do is iterate over the dataframe and use normal INSERT (inside a Try, Except). If you get a duplicate it will go to the except, there you can change the id and try to do a new INSERT – nacho May 21 '19 at 08:44
  • I used dataframe because i wanted to avoid inserting rows 1by1 using chunksize – Hanggy May 21 '19 at 08:54
  • Do you think it is possible to update the `INSERT INTO tablename id values(123) ON DUPLICATE KEY UPDATE values(123) = values(1234) ` ? – Hanggy May 21 '19 at 09:10
  • Yes, you can update the PRIMARY KEY by doing INSERT INTO tablename set id=123 ON DUPLICATE KEY UPDATE id=1234. But that will delete the original row and insert a new row with the new id (you won't have two rows, one with id 123 and another one with id 1234, you will only have one row with id 1234) – nacho May 21 '19 at 09:24
  • And you know if something exists in SQL to do something like that ? – Hanggy May 21 '19 at 09:38
  • I think there is nothing to solve it. The only way to do it is what a told you before, iterating over the dataframe and doing an INSERT. If you get a duplicate you modify your id (in python) and try again – nacho May 21 '19 at 09:42

1 Answers1

2

Your problem is:

DUPLICATE VALUE FOR PRIMARY KEY

This is the index in pandas. Panda allows duplicates, for the index and MySQL and other relational databases do not allow that.

A primary key is a special relational database table column (or combination of columns) designated to uniquely identify all table records. A primary key's main features are: It must contain a unique value for each row of data.

So your problem is how to eliminate duplicates on pandas index.

prosti
  • 42,291
  • 14
  • 186
  • 151