3

I have a list of values (not known in advance, in a Python list) that a column in my Panda DataFrame must not have for all rows.

All recipes on the Web (like this one) show how to do it with only one value to exclude, but I have multiple values to exclude. How do I do it?

Please note that I cannot hardcode the values to exclude in my code.

Thanks!

Community
  • 1
  • 1
GDICommander
  • 1,273
  • 3
  • 15
  • 27
  • Possible duplicate of [How to implement 'in' and 'not in' for Pandas dataframe](http://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe) – root Jun 29 '16 at 20:15

3 Answers3

9

You can do negative isin() indexing:

In [57]: df
Out[57]:
   a  b  c
0  1  2  2
1  1  7  0
2  3  7  1
3  3  2  7
4  1  3  1
5  3  4  2
6  0  7  1
7  5  4  3
8  6  1  0
9  3  2  0

In [58]: my_list = [1, 7, 8]

In [59]: df.loc[~df.b.isin(my_list)]
Out[59]:
   a  b  c
0  1  2  2
3  3  2  7
4  1  3  1
5  3  4  2
7  5  4  3
9  3  2  0

or using query() function:

In [60]: df.query('@my_list not in b')
Out[60]:
   a  b  c
0  1  2  2
3  3  2  7
4  1  3  1
5  3  4  2
7  5  4  3
9  3  2  0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
4

you could also use np.in1d. From https://stackoverflow.com/a/38083418/2336654

For your use case:

df[~np.in1d(df.b, my_list)]

Demonstration

from string import ascii_letters, ascii_lowercase, ascii_uppercase
df = pd.DataFrame({'lower': list(ascii_lowercase), 'upper': list(ascii_uppercase)}).head(6)
exclude = list(ascii_uppercase[:6:2])

print df

  lower upper
0     a     A
1     b     B
2     c     C
3     d     D
4     e     E
5     f     F

print exclude

['A', 'C', 'E']

print df[~np.in1d(df.upper, exclude)]

  lower upper
1     b     B
3     d     D
5     f     F

Timing

Various methods

enter image description here

With 1.3 Million rows excluding 3 items

enter image description here

1.3 Million rows excluding 12 items

enter image description here

Conclusion

Clearly, isin and query scale better in this case.

Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    Pandas uses `np.in1d` [under the hood](https://github.com/pydata/pandas/blob/05e734ab171be0fda838c6b12839c38fa588da2c/pandas/core/algorithms.py#L123). It looks like they optimized it use `np.in1d` only when it will generally be beneficial, which is on large datasets. It's slightly faster on your small example dataset, but when I tried it on similar data with 6k rows, and excluding 6 letters, `isin` was nearly twice as fast. – root Jun 29 '16 at 20:38
  • Also, I noticed that `ix` slows it down a bit. I'll rerun over large dataset and excluding `ix`. – piRSquared Jun 29 '16 at 20:39
  • Yeah, it's still worthwhile to mention `np.in1d` regardless though, as it's often useful in other contexts too. – root Jun 29 '16 at 20:41
  • I just made [another comparison for different dtypes](http://stackoverflow.com/a/38110564/5741205) - `np.in1d()` is __very__ slow for strings, but it wins for all numeric searches... – MaxU - stand with Ukraine Jun 29 '16 at 22:15
1

I decided to add another answer with timings for different methods and different dtypes - it would be too long for one answer...

Timings against 1M rows DF for the following dtypes: int32, int64, float64, object (string):

enter image description here

In [207]: result
Out[207]:
                              int32  int64  float  string
method
df[~np.in1d(df.col, excl)]      249    271    307    2420
df[~df.col.isin(excl)]          414    210    514     189
df.ix[~df.col.isin(excl)]       399    212    553     189
df.query('@excl not in col')    415    228    563     206

In [208]: result.T
Out[208]:
method  df[~np.in1d(df.col, excl)]  df[~df.col.isin(excl)]  df.ix[~df.col.isin(excl)]  df.query('@excl not in col')
int32                          249                     414                        399                           415
int64                          271                     210                        212                           228
float                          307                     514                        553                           563
string                        2420                     189                        189                           206

Raw results:

int32:

In [159]: %timeit df[~np.in1d(df.int32, exclude_int32)]
1 loop, best of 3: 249 ms per loop

In [160]: %timeit df[~df.int32.isin(exclude_int32)]
1 loop, best of 3: 414 ms per loop

In [161]: %timeit df.ix[~df.int32.isin(exclude_int32)]
1 loop, best of 3: 399 ms per loop

In [162]: %timeit df.query('@exclude_int32 not in int32')
1 loop, best of 3: 415 ms per loop

int64:

In [163]: %timeit df[~np.in1d(df.int64, exclude_int64)]
1 loop, best of 3: 271 ms per loop

In [164]: %timeit df[~df.int64.isin(exclude_int64)]
1 loop, best of 3: 210 ms per loop

In [165]: %timeit df.ix[~df.int64.isin(exclude_int64)]
1 loop, best of 3: 212 ms per loop

In [166]: %timeit df.query('@exclude_int64 not in int64')
1 loop, best of 3: 228 ms per loop

float64:

In [167]: %timeit df[~np.in1d(df.float, exclude_float)]
1 loop, best of 3: 307 ms per loop

In [168]: %timeit df[~df.float.isin(exclude_float)]
1 loop, best of 3: 514 ms per loop

In [169]: %timeit df.ix[~df.float.isin(exclude_float)]
1 loop, best of 3: 553 ms per loop

In [170]: %timeit df.query('@exclude_float not in float')
1 loop, best of 3: 563 ms per loop

object / string:

In [171]: %timeit df[~np.in1d(df.string, exclude_str)]
1 loop, best of 3: 2.42 s per loop

In [172]: %timeit df[~df.string.isin(exclude_str)]
10 loops, best of 3: 189 ms per loop

In [173]: %timeit df.ix[~df.string.isin(exclude_str)]
10 loops, best of 3: 189 ms per loop

In [174]: %timeit df.query('@exclude_str not in string')
1 loop, best of 3: 206 ms per loop

Conclusion:

np.in1d() - wins for (int32 and float64) searches, but it's approx. 10 times slower (compared to others) when searching strings, so don't use it for object (strings) and for int64 dtypes!

Setup:

df = pd.DataFrame({
    'int32':    np.random.randint(0, 10**6, 10),
    'int64':    np.random.randint(10**7, 10**9, 10).astype(np.int64)*10,
    'float':    np.random.rand(10),
    'string':   np.random.choice([c*10 for c in string.ascii_uppercase], 10),
    })


df = pd.concat([df] * 10**5, ignore_index=True)

exclude_str = np.random.choice([c*10 for c in string.ascii_uppercase], 100).tolist()
exclude_int32 = np.random.randint(0, 10**6, 100).tolist()
exclude_int64 = (np.random.randint(10**7, 10**9, 100).astype(np.int64)*10).tolist()
exclude_float = np.random.rand(100)


In [146]: df.shape
Out[146]: (1000000, 4)

In [147]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
float     1000000 non-null float64
int32     1000000 non-null int32
int64     1000000 non-null int64
string    1000000 non-null object
dtypes: float64(1), int32(1), int64(1), object(1)
memory usage: 26.7+ MB

In [148]: df.head()
Out[148]:
      float   int32       int64      string
0  0.221662  283447  6849265910  NNNNNNNNNN
1  0.276834  455464  8785039710  AAAAAAAAAA
2  0.517846  618887  8653293710  YYYYYYYYYY
3  0.318897  363191  2223601320  PPPPPPPPPP
4  0.323926  777875  5357201380  QQQQQQQQQQ
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419