5

Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should

  1. Load the data from known_data.csv as a dataframe called existing_data

  2. Add a new column called 'existing' to the existing_data df

  3. Merge the old existing_data dataframe with the dataframe called new_data on the five columns

  4. Check whether new_data contains new rows by looking at merge[merge.existing.isnull()] (the complement of the new data and the existing data)

  5. Append the new rows to the known_data.csv file

My code looks like this

existing_data = pd.read_csv("known_data.csv")

existing_data['existing'] = 'yes'

merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')

complement = merge_data[merge_data.existing.isnull()]

del complement['existing']

complement.to_csv("known_data.csv", mode='a', index=False,
                  header=False) 

Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.

Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?

Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!

Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.

Community
  • 1
  • 1
Titus
  • 234
  • 1
  • 8
  • what does `complement ` return, did you check it? – Umar.H Nov 28 '19 at 17:26
  • 1
    @Datanovice Complement returns rows that are already present in the `known_data.csv` file! I have inspected these rows in the `known_data.csv` file and they appear to be identical – Titus Nov 28 '19 at 17:29
  • it's quite hard to write a solution without seing your data (input and expected) just a sample. for now have a read of [this](https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe) should point you in the right direction. – Umar.H Nov 28 '19 at 17:55

1 Answers1

2

I think what you are looking for is a concat operation followed by a drop duplicate.

# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)

# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)

This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!

See the documentation here:

concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

julian
  • 451
  • 2
  • 8