I got two data frames. One contains the main data (called dtt_main
) and could be huge, the other one (called dtt_selected
) contains only two columns, which are also available in the main data frame. For every entry in the dtt_selected
, I want to check whether the same values are included in dtt_main
. If so, this row(s) should be removed (these values are not unique in the dtt_main
, so one could remove multiple rows by applying this criterion). I managed to write a small function which does exactly this, but is really slow because I have to iterate over both dataframes simultaneously. I would be very happy about a faster, more pandas-like solution. Thanks!
# The real data set contains ~100_000 rows and ~1000 columns
dtt_main = pd.DataFrame({
'a': [1,1,1,2,2,4,5,4],
'b': [1,1,2,2,3,3,4,6],
'data': list('abcdefgh')
})
dtt_selected = pd.DataFrame({
'a': [1,1,2,4],
'b': [1,5,3,6]
})
def remove_selected(dtt_main, dtt_selected):
for row_select in dtt_select.itertuples():
for row_main in dtt_main.itertuples():
# First entry of the tuples is the index!
if (row_select[1] == row_main[1]) & (row_select[2] == row_main[2]):
dtt_main.drop(row_main[0], axis='rows', inplace=True)
remove_selected(dtt_main, dtt_selected)
print(dtt_main)
>>> a b data
>>> 2 1 2 c
>>> 3 2 2 d
>>> 5 4 3 f
>>> 6 5 4 g