Desire:
I want a way to merge two data frames and keep the non-intersected data from a specified data frame.
Problem:
I have duplicate data and I expected this line to remove that duplicate data:
final_df = new_df[~new_df.isin(previous_df)].dropna()
Example data and data test:
record = Record(1000, 9300815, '<redacted type>', '<redacted id>')
test_df = pd.DataFrame([record])
if not final_df.empty:
# this produces an empty data frame
empty_df = test_df[test_df.isin(final_df)].dropna()
# this produces the record
record_df = final_pdf[final_pdf.col01 == record.col01]
Background:
I'm loading xml data and converting the xml file into several different record types as namedtuples. I split each record type into its own dataframe. I then compare the current set of data from the xml file with the data already loaded into the database by constructing previous_df
as such:
previous_df = pd.read_sql_table(table_name, con=conn, schema=schema, columns=columns)
Columns are dynamically created based on the fields in the named tuple. The database schema is generated using sqlalchemy, and I have added UniqueConstraint
to manage when I think there are duplicates within the database.
Thanks in advance for any help provided.