1

I want to add a dataframe to my DB that contains some duplicate rows, but because my real database is very big, I don't want to drop the existing table and add it again with the updated rows, as it require me to pull it again from an API

I think the correct aproach could be to df.read_sql() and then compare the result with df2 and then only insert the rows which are not allready existing

import sqlalchemy as db
import pandas as pd

engine = db.create_engine('sqlite:///test.db', echo = True)

data1 = {'Month':['June', 'July', 'August'],
        'Number':[20, 21, 19]}
df = pd.DataFrame(data1)
df.to_sql("ExampleTable", engine,if_exists="append", index=False)

data2 = {'Month':['May','June', 'July', 'August', 'Septemper'],
        'Number':[11, 20, 21, 19, 14, 15]}
df2 = pd.DataFrame(data2)
df2.to_sql("ExampleTable", engine,if_exists="append", index=False)
Soma Juice
  • 369
  • 1
  • 11
  • 1
    _"I would like to append the months which don't exist in data1 from data2, but in correct order according to a calendar."_ You shouldn't rely on storage order, but consider database tables unordered sets. If you need ordering, you should specify an `ORDER BY` clause when querying. – Mark Rotteveel Jan 16 '22 at 19:45
  • We're not supposed to say "me, too", but this is an important point that many today have lost sight of. Rows in an SQL database are UNORDERED. That is by definition. Unless you provide an `ORDER BY` clause, the database engine is free to return rows in whatever order it wants. If you change the table in any way, the ordering might change from query to query. – Tim Roberts Jan 16 '22 at 19:50
  • very intersting, but the way i made my program it unfortuantely have to be ordered for now. – Soma Juice Jan 16 '22 at 19:50
  • 1
    They aren't. That's a FACT. You MUST use an `ORDER BY` clause when you fetch your query. There is no other option. – Tim Roberts Jan 16 '22 at 19:51
  • Oh okay , i think i dont comprehend this stuff fully i suppose :) – Soma Juice Jan 16 '22 at 19:52
  • 1
    Can it be assumed that the original table you want to append data to has already been stored as a dataframe in df1? Seems like you'd want to do this in the database and not pull all that "very big" data into a dataframe. If there are two dfs, one with existing data and one with existing+new data, then you can just get the rows in df2 that don't exist in df1, and append the unique rows in df2 to the table. As other authors have stated, order doesn't matter on write, just on read. Please clarify to help identify a solution – frederick-douglas-pearce Jan 16 '22 at 20:26
  • 1
    @frederick-douglas-pearce yes it can be assumed, that the orignal table has been stored as a dataframe, so actually what i want to do is to read the DB as a dataframe then compare it whit the new dataframe and then insert whats not there – Soma Juice Jan 16 '22 at 21:07
  • 1
    Related: https://stackoverflow.com/q/62388767/2144390 – Gord Thompson Jan 16 '22 at 22:21

2 Answers2

1

So just to clarify some things.

  1. SQLite does NOT store things in order based on data. It stores it based on ROW ID. However you can retrieve data and then have sqlite order the returned data based on a column(s) in the sql query.

  2. You're asking to insert data and then modify or update that data in a second request. Dataframes doesn't really offer this type of functionally. From what I understand is that pandas can insert to an already existing table or drop the table and insert the new data. So this is something you will most likely have to run raw sql queries to achieve. This post shows someone doing what is called an "upsert" in sql for pandas but required that they create a temp table and use that table to modify data in the original table. How do I perform an UPDATE of existing rows of a db table using a Pandas DataFrame?

TeddyBearSuicide
  • 1,377
  • 1
  • 6
  • 11
1

Sounds like this problem boils down to finding the rows in a new dataframe, df2, that are not in an original dataframe, df, so that the rows that are unique to df2 can be appended to an existing sql table. A similar dataframe comparison scenario has been discussed in Stack Overflow posts here and here. If you only need to compare across the 'Month' column, then this should work:

df2_month_not_in_df = df2[~df2['Month'].isin(df['Month'])]

But that assumes no null 'Month' values to worry about, no need to include other columns like year, etc. A more rigorous approach that compares dataframe rows across multiple columns, possibly with null values, can be implemented following the answer by @toecsnar42 and others in the SO links above:

df_str_tuples = df.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df2_rows_in_df_filter = df2_str_tuples.isin(df_str_tuples)
df2_rows_not_in_df = df2[~df2_rows_in_df_filter]

For each dataframe, convert all values to a string type, which changes missing values to 'nan', then create a series by combining the values from each column into a single tuple for each row. Next, create a mask that is true whenever df2 has a row that matches a row in df. Finally, filter the df2 dataframe to only the rows where the mask is false (i.e. not in df), which builds the dataframe, df2_rows_not_in_df, that can be appended to the existing sql table using to_sql with if_exists="append". Other implementation approaches from past posts use pd.merge or pd.concat methods instead so different options to consider depending on your use case.