Insert ... ON DUPLICATE KEY INSERT something new

Question

I am working on a pandas dataframe and with sqlalchemy . What I am trying to do is to merge multiple identical databases into one. The problem is that the tables may have duplicates between DBs. Another problem is that I have something around 30M rows and this causes time loss when working around the dataframe (even with chunksize).

I tried to iterate all over the dataframe but that's way too long for checking duplicates. So I wanted to use the SQL query INSERT ... ON DUPLICATE KEY ... do something but the only possibilities are UPDATE and IGNORE. What I am trying is to :
->insert every rows -> on duplicate key -> update the row I tried to insert and insert it with new id
I couldn't find anything with sqlalchemy for this problem on google

What i'm doing:

import pandas as pd

df = pd.DataFrame()
#put all datas in the df for one tables but from all databases

engine = create_engine("mysql://xxx:xxx@localhost/{db}".format(db=dbname))

df.to_sql(con=engine, name=tableName, if_exists='append', chunksize=chunksize, index= False)

And the MYSQL error is DUPLICATE VALUE FOR PRIMARY KEY

EDIT: Adding the table schema

table1 = Table('table1', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    mysql_engine='InnoDB'
    )

table2= Table('table2', metadata,
    Column('id', VARCHAR(40), primary_key=True,nullable=False),
    Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
    )

table3= Table('table3', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
    Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table4= Table('table4', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table5= Table('table5', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

table6= Table('table6', metadata,
    Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
    Column('id', VARCHAR(40),nullable=False),
    Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
    )

When you do an INSERT ... ON DUPLICATE KEY .. UPDATE, it does what you say you need, inserts every row and if finds a duplicate key, updates the values you need — nacho, May 21 '19 at 08:01
http://www.mysqltutorial.org/mysql-insert-or-update-on-duplicate-key-update/ This does not say the same as you — Hanggy, May 21 '19 at 08:22
That tutorial says exactly the same as i say. Maybe you didn't understand it. The system tries to insert a row. If it does not exist as a primary key, inserts the row and continues. If the primary key exists, updates the columns you specify after the UPDATE and then continues. https://dev.mysql.com/doc/refman/5.5/en/insert-on-duplicate.html — nacho, May 21 '19 at 08:30
`If you specify an ON DUPLICATE KEY UPDATE clause and a row to be inserted would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row occurs.` That is the first line of your link. I don't want to update the old row but to add a new one with a modified key — Hanggy, May 21 '19 at 08:38
That ´ s not what you said in your question. You talked about updating the row, no updating the primary key and insert it as a new row — nacho, May 21 '19 at 08:42
Yeah i updated the question so that is clearer, imeant update the row I was actually inserting sorry was confusing — Hanggy, May 21 '19 at 08:43
What you can do is iterate over the dataframe and use normal INSERT (inside a Try, Except). If you get a duplicate it will go to the except, there you can change the id and try to do a new INSERT — nacho, May 21 '19 at 08:44
I used dataframe because i wanted to avoid inserting rows 1by1 using chunksize — Hanggy, May 21 '19 at 08:54
Do you think it is possible to update the `INSERT INTO tablename id values(123) ON DUPLICATE KEY UPDATE values(123) = values(1234) ` ? — Hanggy, May 21 '19 at 09:10
Yes, you can update the PRIMARY KEY by doing INSERT INTO tablename set id=123 ON DUPLICATE KEY UPDATE id=1234. But that will delete the original row and insert a new row with the new id (you won't have two rows, one with id 123 and another one with id 1234, you will only have one row with id 1234) — nacho, May 21 '19 at 09:24
And you know if something exists in SQL to do something like that ? — Hanggy, May 21 '19 at 09:38
I think there is nothing to solve it. The only way to do it is what a told you before, iterating over the dataframe and doing an INSERT. If you get a duplicate you modify your id (in python) and try again — nacho, May 21 '19 at 09:42

score 2 · Answer 1 · answered May 21 '19 at 08:38

2

Your problem is:

DUPLICATE VALUE FOR PRIMARY KEY

This is the index in pandas. Panda allows duplicates, for the index and MySQL and other relational databases do not allow that.

A primary key is a special relational database table column (or combination of columns) designated to uniquely identify all table records. A primary key's main features are: It must contain a unique value for each row of data.

So your problem is how to eliminate duplicates on pandas index.

answered May 21 '19 at 08:38

prosti

42,291
14
186
151

That's what i tried to do but because the dataframe is very big, it takes too long to iterate all over – Hanggy May 21 '19 at 08:41
I believe, just keep on rolling. You are getting there @Hanggy. It takes time :) – prosti May 21 '19 at 08:44
Yeah but i need to make it as shorter as possible – Hanggy May 21 '19 at 09:04
Then you can drop the primary key restriction in MySQL and you will have it. – prosti May 21 '19 at 09:05
These keys make the database faster, but if you just need to put the data in, and leater you will decide on MySQL end what to do, you are good. – prosti May 21 '19 at 09:06
I would like to do that but it also a foreign key on multiple tables so i have to make it unique for the FK constraint :'( – Hanggy May 21 '19 at 09:49
I can add the table schema to the question if you want – Hanggy May 21 '19 at 09:56
sure add it pls – prosti May 21 '19 at 10:01

Insert ... ON DUPLICATE KEY INSERT something new

1 Answers1