1

I have a pandas dataframe with string values and I would like to be able to return a subset of the dataframe where the values contain some substring. This is easy to do on a series in this way (example adapted from pandas documentation):

import pandas as pd
import numpy as np
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4[s4.str.contains('A', na=False)]
0       A
3    Aaba
6    CABA
dtype: object

I would expect searching for substrings in all the columns of a dataframe to work the same as for series, but there are no .str methods for dataframes. You can filter a dataframe for an exact match of a string like this:

df = pd.concat((s4, s4.shift(1)), axis=1)
      0     1
0     A   NaN
1     B     A
2     C     B
3  Aaba     C
4  Baca  Aaba
5   NaN  Baca
6  CABA   NaN
7   dog  CABA
8   cat   dog

filtered_df = df[df == 'Baca']
      0     1
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3   NaN   NaN
4  Baca   NaN
5   NaN  Baca
6   NaN   NaN
7   NaN   NaN
8   NaN   NaN

result = df[df contains 'ac']

I would hope that would return the same thing as filtered_df = df[df == 'Baca'] but it's invalid syntax. I tried using df.apply to apply the series.str.contains method to each series of the dataframe. If that is a viable solution I wasn't able to work it out. I'm using python 3.5, and pandas 0.18 on Linux/Ubuntu.

Alex
  • 2,154
  • 3
  • 26
  • 49
  • 1
    Related: http://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter – Jon Clements Jun 17 '16 at 10:33

1 Answers1

3

Another solution is apply contains:

mask = df.apply(lambda x: x.str.contains('A', na=False))
print (mask)

       0      1
0   True  False
1  False   True
2  False  False
3   True  False
4  False   True
5  False  False
6   True  False
7  False   True
8  False  False

print (df[mask])
      0     1
0     A   NaN
1   NaN     A
2   NaN   NaN
3  Aaba   NaN
4   NaN  Aaba
5   NaN   NaN
6  CABA   NaN
7   NaN  CABA
8   NaN   NaN

If need check at least one True use any:

mask = df.apply(lambda x: x.str.contains('ac', na=False))
print (mask)
       0      1
0  False  False
1  False  False
2  False  False
3  False  False
4   True  False
5  False   True
6  False  False
7  False  False
8  False  False

print (mask.any(1))
0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252