5

I am trying to insert data from xlsx file into mysqdl table. I want to insert data in table and if there is a duplicate on primary keys, I want to update the existing data otherwise insert. I have written the script already but I realized it is too much work and using pandas it is quick. How can I achieve it in pandas?

#!/usr/bin/env python3

import pandas as pd
import sqlalchemy

engine_str = 'mysql+pymysql://admin:mypass@localhost/mydb'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')\

file_name = "tmp/results.xlsx"
df = pd.read_excel(file_name)
r0xette
  • 898
  • 3
  • 11
  • 24

3 Answers3

10

I can think of two options, but number 1 might be cleaner/faster:

1) Make SQL decide on the update/insert. Check this other question. You can iterate by rows of your 'df', from i=1 to n. Inside the loop for the insertion you can write something like:

query = """INSERT INTO table (id, name, age) VALUES(%s, %s, %s)
ON DUPLICATE KEY UPDATE name=%s, age=%s"""
engine.execute(query, (df.id[i], df.name[i], df.age[i], df.name[i], df.age[i]))

2) Define a python function that returns True or False when the record exists and then use it in your loop:

def check_existence(user_id):
    query = "SELECT EXISTS (SELECT 1 FROM your_table where user_id_str = %s);"
    return list(engine.execute(query,  (user_id, ) ) )[0][0] == 1

You could iterate over rows and do this check before inserting

Please also check the solution in this question and this one too which might work in your case.

Community
  • 1
  • 1
cd98
  • 3,442
  • 2
  • 35
  • 51
  • The reason I am trying to avoid SQL query itself is because the excel I receive often has column values which are to be escaped before I can create a query and insert into database. What I was hoping for since pandas handle the column value very efficiently, if there is a way to use pandas to insert data directly rather than executing sql query that would help me bypassing the escaping of random values(or garbage data). – r0xette Oct 30 '16 at 22:22
  • 1
    @r0xette I don't get why you can't use an SQL query. Are you sure you can't change the column names or something? In any case, maybe using `sqlalchemy` will allow you to do this avoiding explicit SQL string queries. AFAIK, Pandas doesn't have a native way of inserting vs updating rows – cd98 Nov 01 '16 at 18:44
  • I am using SQL query now by escaping the double quotes etc as they come. What I was hoping for if there was a way to perform upsert natively so the bad data could be escaped automatically. – r0xette Nov 01 '16 at 18:56
1

Pangres is the tool for this job.

Overview here: https://pypi.org/project/pangres/

Use the function pangres.fix_psycopg2_bad_cols to "clean" the columns in the DataFrame.

Code/usage here: https://github.com/ThibTrip/pangres/wiki https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres Example code:

# From: <https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres>
import pandas as pd

# fix bad col/index names with default replacements (empty string for '(', ')' and '%'):

df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')
print(df)

test()  foo()%
     0      0

# clean cols, index w/ no replacements
df_fixed = fix_psycopg2_bad_cols(df)

print(df_fixed)

test    foo
   0      0

# fix bad col/index names with custom replacements - you MUST provide replacements for '(', ')' and '%': 

# reset df
df = pd.DataFrame({'test()':[0],
                   'foo()%':[0]}).set_index('test()')

# clean cols, index w/ user-specified replacements
df_fixed = fix_psycopg2_bad_cols(df, replacements={'%':'percent', '(':'', ')':''})

print(df_fixed)
test    foopercent
   0             0

Will only fix/correct some of the bad characters:

Replaces '%', '(' and ')' (characters that won't play nicely or even at all)

But, useful in that it handles cleanup and upsert.

(p.s., I know this post is over 4 years old, but still shows up in Google results when searching for "pangres upsert determine number inserts and updates" as the top SO result, dated May 13, 2020.)

RUDelerius
  • 49
  • 4
  • For the code/usage section, could you provide examples from that link in the answer itself? Links are unreliable and may break, but if you provide the relevant pieces of the linked examples, this answer will stand the test of time a bit better. – gallen Dec 18 '20 at 20:00
  • @gallen - code added from source/link as requested – RUDelerius Dec 23 '20 at 16:24
-4

When using Pandas no iteration is needed. Isn't that faster?

df = pd.read_csv(csv_file,sep=';',names=['column'])
    
df.to_sql('table', con=con, if_exists='append', index=False, chunksize=20000)
wfolkerts
  • 99
  • 1
  • 4
  • 1
    The if_exists='overwrite' here will drop the table, so all data other than your df will be lost in the original table. I think the questions is how to keep the not conflict data in the database table but only update the conflict ones. – Yingbo Miao Jun 18 '20 at 14:16