2

I have 152431 X 15 shape data frame and I want the difference of two frames


# df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

# df2:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange
emreyuksel
  • 15
  • 2

1 Answers1

0

if your dataframes are stored in two files I would read each line of each file in a loop and create a list with the differences:

old_file_path = 'INSERT_FILE_PATH_OF_FILE_A'
new_file_path = 'INSER_FILE_PATH_OF_FILE_B'

with open(old_file_path, 'r', encoding='utf-8') as old ,open(new_file_path, 'r', encoding='utf-8') as new:
    fileone = old.readlines()
    filetwo = new.readlines()

total_of_changes=[]
for line in filetwo:
    if line not in fileone:
        total_of_changes.append(line)
emiljoj
  • 399
  • 1
  • 7
  • Nooooo, please don't do that! Especially when using pandas, there are **far** better options than reading and comparing each file **line-by-line**. With 152k rows, this is absolutely inefficient and furthermore unpythonic and clumsy. – JE_Muc Feb 21 '20 at 12:11
  • Fair enough, a more pythonic approach would help me too. Did you have a specific function in mind? :) – emiljoj Feb 21 '20 at 14:07
  • 1
    Yes, Chris A posted a nice solution in his comment: `pd.concat([df1, df2]).drop_duplicates(keep=False)` – JE_Muc Feb 21 '20 at 14:19