How to remove a subset of a data frame in Python?

Question

My dataframe df is 3020x4. I'd like to remove a subset df1 20x4 out of the original. In other words, I just want to get the difference whose shape is 3000x4. I tried the below but it did not work. It returned exactly df. Would you please help? Thanks.

new_df = df.drop(df1)

What is this subset? is it a number of index values, specific values etc.? — EdChum, Sep 09 '16 at 09:20
Or are you just wanting to diff the 2 dfs? like `merged = df.merge(df1, indicator=True, how='left')` `merged[merged['_merge'] == 'left_only']` — EdChum, Sep 09 '16 at 09:25

score 17 · Accepted Answer · answered Sep 09 '16 at 09:37

As you seem to be unable to post a representative example I will demonstrate one approach using merge with param indicator=True:

So generate some data:

In [116]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df

Out[116]:
          a         b         c
0 -0.134933 -0.664799 -1.611790
1  1.457741  0.652709 -1.154430
2  0.534560 -0.781352  1.978084
3  0.844243 -0.234208 -2.415347
4 -0.118761 -0.287092  1.179237

take a subset:

In [118]:
df_subset=df.iloc[2:3]
df_subset

Out[118]:
         a         b         c
2  0.53456 -0.781352  1.978084

now perform a left merge with param indicator=True this will add _merge column which indicates whether the row is left_only, both or right_only (the latter won't appear in this example) and we filter the merged df to show only left_only:

In [121]:
df_new = df.merge(df_subset, how='left', indicator=True)
df_new = df_new[df_new['_merge'] == 'left_only']
df_new

Out[121]:
          a         b         c     _merge
0 -0.134933 -0.664799 -1.611790  left_only
1  1.457741  0.652709 -1.154430  left_only
3  0.844243 -0.234208 -2.415347  left_only
4 -0.118761 -0.287092  1.179237  left_only

here is the original merged df:

In [122]:
df.merge(df_subset, how='left', indicator=True)

Out[122]:
          a         b         c     _merge
0 -0.134933 -0.664799 -1.611790  left_only
1  1.457741  0.652709 -1.154430  left_only
2  0.534560 -0.781352  1.978084       both
3  0.844243 -0.234208 -2.415347  left_only
4 -0.118761 -0.287092  1.179237  left_only

index_to_keep = df.index.symmetric_difference(subset.index);df.loc[index_to_keep, :] — PhilChang, Sep 09 '16 at 09:42
@PhilChang that assumes that the indices along with their contents are the same between the larger df and the subset, as the OP hasn't posted any sample data, here `merge` will just work as it will use the column values — EdChum, Sep 09 '16 at 09:44

gciriani · Answer 2 · 2020-01-27T19:02:55.580

15

The pandas cheat sheet suggests also the following technique

adf[~adf.x1.isin(bdf.x1)]

where x1 is the column being compared, adf is the dataframe from which the corresponding rows appearing in dataframe bdf are taken out.

The particular question asked by the OP can also be solved by

new_df = df.drop(df1.index)

edited Jan 27 '20 at 19:02

answered Jan 27 '20 at 18:52

gciriani

611
2
7
19

How to remove a subset of a data frame in Python?

2 Answers2

Linked