Each day I get a pandas dataframe that has five columns called column1
, column2
, column3
, column4
, column5
. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv
. In order to do so, I wrote some code that should
Load the data from
known_data.csv
as a dataframe calledexisting_data
Add a new column called 'existing' to the
existing_data
dfMerge the old
existing_data
dataframe with the dataframe callednew_data
on the five columnsCheck whether
new_data
contains new rows by looking atmerge[merge.existing.isnull()]
(the complement of the new data and the existing data)Append the new rows to the
known_data.csv
file
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv
, some of the rows of new_data
are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing
to the existing_data
dataframe is probably not the best way of checking the complement between existing_data
and new_data
. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.