1

I got two data frames. One contains the main data (called dtt_main) and could be huge, the other one (called dtt_selected) contains only two columns, which are also available in the main data frame. For every entry in the dtt_selected, I want to check whether the same values are included in dtt_main. If so, this row(s) should be removed (these values are not unique in the dtt_main, so one could remove multiple rows by applying this criterion). I managed to write a small function which does exactly this, but is really slow because I have to iterate over both dataframes simultaneously. I would be very happy about a faster, more pandas-like solution. Thanks!

# The real data set contains ~100_000 rows and ~1000 columns
dtt_main = pd.DataFrame({
    'a': [1,1,1,2,2,4,5,4], 
    'b': [1,1,2,2,3,3,4,6], 
    'data': list('abcdefgh')
})

dtt_selected = pd.DataFrame({
    'a': [1,1,2,4], 
    'b': [1,5,3,6]
})

def remove_selected(dtt_main, dtt_selected):
    for row_select in dtt_select.itertuples():
        for row_main in dtt_main.itertuples():
            # First entry of the tuples is the index!
            if (row_select[1] == row_main[1]) & (row_select[2] == row_main[2]):
                dtt_main.drop(row_main[0], axis='rows', inplace=True)

remove_selected(dtt_main, dtt_selected)
print(dtt_main)
>>> a  b data
>>> 2  1  2    c
>>> 3  2  2    d
>>> 5  4  3    f
>>> 6  5  4    g
Timon
  • 343
  • 1
  • 3
  • 6
  • 1
    Possible duplicate of [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) See the section on `LEFT-Excluding JOIN` – ALollz Feb 02 '19 at 18:36

1 Answers1

1

You could left join the DataFrames using pd.merge. By setting indicator=True, it adds a column _merge which will have 'both' if it occurs also in dtt_selected (and therefore should be dropped) and 'left_only' if it was only in dtt_main (and thus should be kept). Now in hte next line, you can first keep only the columns that have 'left_only', and then drop the now unnecessary '_merge'-column:

df1 = dtt_main.merge(dtt_selected, how='left', indicator=True)
df1[df1['_merge'] == 'left_only'].drop(columns='_merge')
#Output
#   a  b data
#2  1  2    c
#3  2  2    d
#5  4  3    f
#6  5  4    g
Jondiedoop
  • 3,303
  • 9
  • 24
  • Thanks for the solution! There is a small mistake in the second line, it should be changed to `df1[df1['_merge'] == 'left_only'].drop('_merge', axis='columns')` – Timon Feb 02 '19 at 19:32