1

I have an SQL table (table_1) that contains data, and I have a Python script that reads a csf and creates a dataframe.

I want to compare the dataframe with the SQL table data and then insert the missing data from the dataframe into the SQL table.

I went around and read this comparing pandas dataframe with sqlite table via sqlquery post and Compare pandas dataframe columns to sql table dataframe columns, but was not able to do it.

The table and the dataframe have the exact same columns.

The dataframe is:

import pandas as pd

df = pd.DataFrame({'userid':[1,2,3],
           'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

and the SQL table (using SQLAlchemy):

userid user income
1      Bob  40000
2      Jane 42000

I'd like to compare the df to the SQL table and insert userid 3, Alice, with all her details and it's the only value missing between them.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Daniel F.
  • 23
  • 4
  • Add some example data which reproduces your actual problem. See more details [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Erfan Jul 23 '21 at 12:34
  • thank you! so the dataframe is import pandas as pd df = pd.DataFrame({'userid':[1,2,3], 'user': ['Bob', 'Jane', 'Alice'], 'income': [40000, 50000, 42000]}) and the sql table (using sqlalchemy) userid user income 1 Bob 40000 2 Jane 42000 what i would like to do is compare the df to the sql table and insert userid 3 Alice with all her details and it s the only value missing between them – Daniel F. Jul 23 '21 at 12:37
  • If a record exists in the SQL table, do you want to update the SQL record, or only insert new ones? – Jason Cook Jul 23 '21 at 12:42
  • @JasonCook , thanks for the reply! for this particular scenario just insert new ones. – Daniel F. Jul 23 '21 at 12:43
  • Can you share more about how you are connecting? pyodbc possibly? I'm thinking a good approach may be to insert the values into an SQL temp table and then issue another SQL statement to do the comparison & insert. – Jason Cook Jul 23 '21 at 12:51
  • @JasonCook , that is my approach when it comes to this type of compartison,either a staging table/temp table but i would like to explore this option as well as setting up stating tables clutter up the db. Connection wise is : conn = create_engine("mssql+pyodbc://user:user@ip:port/db?driver=SQL Server?Trusted_Connection=yes'", echo = False) – Daniel F. Jul 23 '21 at 12:54
  • how about this, suppose you have df1 from csf and df2 from database, merge on all columns, `df_merged = pd.merge(df1, df2, on=df1.columns.tolist(), how='inner')`, then the rest rows in df1 are the ones you want to upload. `df_to_upload = df1[~df1.index.isin(df_merge.index)]` – Da Song Jul 23 '21 at 13:44

3 Answers3

0

There is still some information missing to provide a full answer. For example, what database engine do you use (SQLalchemy, sqlite3)? I assume the id is unique, and all new ids should be added?

If you are using SQLalchemy, you might take a look at pangres, which can insert and update SQL-databases from a pandas dataframe. It does however require a column with the UNIQUE property in the database (meaning that every entry in it is unique, you could set the id column UNIQUE here). This method scales better than loading all data from the database and doing the comparison in python, because only the csf data is in memory, and the database does the comparison.

If you want to do it all in Python, an option is loading the SQL-table into pandas and merging data based on the user_id columns:

import pandas as pd
df = pd.DataFrame({'userid': [0, 1, 2],'user': ['Bob', 'Jane', 'Alice'], 'income': [40000, 50000, 42000]})

sqldf = pd.read_sql_query("SELECT * FROM table_1",connection)
df = df.merge(sqldf,how='left' left_on='userid', right_on='userid')

Then you can replace the old table with the new table. EDIT: I saw another answer using merge, but keeping the new values and only sending them to the database. This is cleaner than the code above.

0

Since you are only interested in inserting new records, and are loading from a CSV so you will have data in local memory already:

# read current userids
sql = pd.read_sql('SELECT userid FROM table_name', conn)

# keep only userids not in the sql table
df = df[~df['userid'].isin(sql['userid'])]

# insert new records
df.to_sql('table_name', conn, if_exists='append')

Other options would require first loading more data into SQL than needed.

Jason Cook
  • 1,236
  • 9
  • 12
0

Why not just left join the tables?

conn = #your connection
df = pd.DataFrame({'userid':[1,2,3],
     'user': ['Bob', 'Jane', 'Alice'], 
     'income': [40000, 50000, 42000]})
sql = pd.read_sql("SELECT * FROM table", con = conn)

joined = pd.merge(df, sql, how = "left", on = "userid")
joined = joined[pd.isna(joined["user_y"])]
index = joined["userid"].tolist()

variable index now contains all userids that are only in df but not in sql.

To insert to database

columns = ("userid", "user", "income")
for i in index:
    data = tuple(df[df["userid"] == i].values.tolist()[0])
    data = [str(x) for x in data]
    sql = f"""INSERT INTO table {columns}
          VALUES {data}"""
    conn.execute(sql)
Dharman
  • 30,962
  • 25
  • 85
  • 135