I am working on a pandas dataframe and with sqlalchemy . What I am trying to do is to merge multiple identical databases into one. The problem is that the tables may have duplicates between DBs. Another problem is that I have something around 30M rows and this causes time loss when working around the dataframe (even with chunksize).
I tried to iterate all over the dataframe but that's way too long for checking duplicates. So I wanted to use the SQL query INSERT ... ON DUPLICATE KEY ... do something
but the only possibilities are UPDATE
and IGNORE
. What I am trying is to :
->insert every rows -> on duplicate key -> update the row I tried to insert and insert it with new id
I couldn't find anything with sqlalchemy for this problem on google
What i'm doing:
import pandas as pd
df = pd.DataFrame()
#put all datas in the df for one tables but from all databases
engine = create_engine("mysql://xxx:xxx@localhost/{db}".format(db=dbname))
df.to_sql(con=engine, name=tableName, if_exists='append', chunksize=chunksize, index= False)
And the MYSQL error is DUPLICATE VALUE FOR PRIMARY KEY
EDIT: Adding the table schema
table1 = Table('table1', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
mysql_engine='InnoDB'
)
table2= Table('table2', metadata,
Column('id', VARCHAR(40), primary_key=True,nullable=False),
Column('id_of', VARCHAR(20),ForeignKey("table1.id"), nullable=False, index= True)
)
table3= Table('table3', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True),
Column('id_produit_enfant', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table4= Table('table4', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table5= Table('table5', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)
table6= Table('table6', metadata,
Column('index',BIGINT(10), primary_key=True,nullable=False,autoincrement=True),
Column('id', VARCHAR(40),nullable=False),
Column('id_produit', VARCHAR(40),ForeignKey("table2.id"), nullable=False, index= True)
)