I searched for this and found a lot of answers, but all of them are for general cases (without meaningful indices). maybe meaningful index isn't the right name and representative index is better -like UUIDs which represent the whole row and it's the only data we need from the row. anyway, I managed to get my work done like this:
prev = pd.DataFrame({'name':['first_old_name','second_old_name', 'third_old_name', 'fourth_old_name', 'fifth_old_name'],
'fetch_date': ['20230122', '20230123', '20230122', '20230123', '20230123']},
index=['917857106093847', '1050751214677134', '2589887561569709', '3542690854557886', '3772339654185462'])
update = pd.DataFrame({'name':['second_old_name', 'first_new_name', 'third_old_name', 'second_new_name'],
'fetch_date': ['20230123', '20230121', '20230122', '20230123']},
index=['1050751214677134', '3542690854456286', '2589887561569709', '3772339112185462'])
common_filter = update.index.isin(prev.index)
print(update[~common_filter])
# Result:
# name fetch_date
# 3542690854456286 first_new_name 20230121
# 3772339112185462 second_new_name 20230123
which seems ok. but since I'm kinda new to python world, I'm always curious:
Q1: is there a better way to do it or not? both performance wise and readability wise (pep8 conventions).
Q2: why some of the answers like this one, used apply(tuple,1)
before using isin()
, compared to applying isin()
directly on dataframes? noting that apply(tuple,1)
is more expensive than df1.isin(df2)
alone? is there something I'm not seeing?
%%timeit
update.isin(prev)
# result
539 µs ± 167 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
update.apply(tuple, 1).isin(prev.apply(tuple,1))
# result
1.51 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
update.index.isin(prev.index)
# result
60.3 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)