4

Desire:

I want a way to merge two data frames and keep the non-intersected data from a specified data frame.

Problem:

I have duplicate data and I expected this line to remove that duplicate data:

 final_df = new_df[~new_df.isin(previous_df)].dropna()

Example data and data test:

 record = Record(1000, 9300815, '<redacted type>', '<redacted id>')
 test_df = pd.DataFrame([record])
 if not final_df.empty:

      # this produces an empty data frame
      empty_df = test_df[test_df.isin(final_df)].dropna()

      # this produces the record
      record_df = final_pdf[final_pdf.col01 == record.col01]

Background:

I'm loading xml data and converting the xml file into several different record types as namedtuples. I split each record type into its own dataframe. I then compare the current set of data from the xml file with the data already loaded into the database by constructing previous_df as such:

 previous_df = pd.read_sql_table(table_name, con=conn, schema=schema, columns=columns)

Columns are dynamically created based on the fields in the named tuple. The database schema is generated using sqlalchemy, and I have added UniqueConstraint to manage when I think there are duplicates within the database.

Thanks in advance for any help provided.

Brian Bruggeman
  • 5,008
  • 2
  • 36
  • 55
  • Look at [pandas.DataFrame.duplicated](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html#pandas-dataframe-duplicated). if you add some sample data and expected output, you get some detailed help. – Scott Boston Jun 27 '17 at 13:56
  • I've also used `drop_duplicates(keep=False)`, which effectively calls `duplicated` and then removes them in a single line. I really need a good way to merge two data frames and remove their intersection. – Brian Bruggeman Jun 27 '17 at 14:35
  • I would recommand to see [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html) about merging dataframes. I don't see exactly where you want to go, but you can find a solution to your problem there – Thomas Dussaut Jun 27 '17 at 14:46

1 Answers1

5

PRESERVING SINGLE RECORDS FROM BOTH DATAFRAMES:

Try to concat the dataframes first, so you are sure that you will have duplicates. Then apply drop_duplicates and I think you will end up with what you are after. See the example below:

#Create dummy data
df1 = pd.DataFrame(columns=["A","B"],data=[[1,2],[3,4],[5,6]])
print(df1)

   A  B
0  1  2
1  3  4
2  5  6

df2 = pd.DataFrame(columns=["A","B"],data=[[3,4],[5,6],[7,8],[9,10]])
print(df2)

   A   B
0  3   4
1  5   6
2  7   8
3  9  10

#Concatenate dataframes
df = pd.concat([df1,df2],axis=0)
print(df)

   A   B
0  1   2
1  3   4
2  5   6
0  3   4
1  5   6
2  7   8
3  9  10

#Drop duplicates
df = df.drop_duplicates(keep=False)
print(df)

   A   B
0  1   2
2  7   8
3  9  10

PRESERVING SINGLE RECORDS FROM ONE DATAFRAME ONLY:

If you only want to keep the data from the new dataframe, just use a dirty little trick: concat the old dataframe twice, so all old records will fall under drop_duplicates criteria. Like so:

#Concatenate dataframes with old dataframe taken twice!
df = pd.concat([df1,df1,df2],axis=0)

#Now you will only end up with the records from second dataframe
df = df.drop_duplicates(keep=False)
print(df)

   A   B
2  7   8
3  9  10
Marjan Moderc
  • 2,747
  • 23
  • 44