0

I searched for this and found a lot of answers, but all of them are for general cases (without meaningful indices). maybe meaningful index isn't the right name and representative index is better -like UUIDs which represent the whole row and it's the only data we need from the row. anyway, I managed to get my work done like this:

prev = pd.DataFrame({'name':['first_old_name','second_old_name', 'third_old_name', 'fourth_old_name', 'fifth_old_name'],
              'fetch_date': ['20230122', '20230123', '20230122', '20230123', '20230123']}, 
             index=['917857106093847', '1050751214677134', '2589887561569709', '3542690854557886', '3772339654185462'])

update = pd.DataFrame({'name':['second_old_name', 'first_new_name', 'third_old_name', 'second_new_name'],
              'fetch_date': ['20230123', '20230121', '20230122', '20230123']}, 
             index=['1050751214677134', '3542690854456286', '2589887561569709', '3772339112185462'])

common_filter = update.index.isin(prev.index) 
print(update[~common_filter])

# Result:
#                              name fetch_date
# 3542690854456286   first_new_name   20230121
# 3772339112185462  second_new_name   20230123

which seems ok. but since I'm kinda new to python world, I'm always curious:

Q1: is there a better way to do it or not? both performance wise and readability wise (pep8 conventions).

Q2: why some of the answers like this one, used apply(tuple,1) before using isin(), compared to applying isin() directly on dataframes? noting that apply(tuple,1) is more expensive than df1.isin(df2) alone? is there something I'm not seeing?

%%timeit
update.isin(prev)

# result
539 µs ± 167 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
update.apply(tuple, 1).isin(prev.apply(tuple,1))

# result
1.51 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
update.index.isin(prev.index) 

# result
60.3 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Indiego
  • 11
  • 4

0 Answers0