6

I know there is a relevant thread about searching for a string in one column (here) but how does one use pd.Series.str.contains(pattern) across all columns?

df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball'],
'id2': [u'uball', u'mball', u'pnut', u'zball']})


In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids  vals
0  aball     1
1  bball     2
3  fball     4
Cibic
  • 316
  • 5
  • 14

2 Answers2

5

Use select_dtypes for only object columns (obviously strings) with applymap and in:

df = pd.DataFrame({'vals': [1, 2, 3, 4], 
                   'ids': [None, u'bball', u'cnut', u'fball'],
                   'id2': [u'uball', u'mball', u'pnut', u'zball']})
print (df)
   vals    ids    id2
0     1   None  uball
1     2  bball  mball
2     3   cnut   pnut
3     4  fball  zball

mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in x if pd.notnull(x) else False)
#if always non NaNs, no Nones
#mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in x)
print (mask)
     ids    id2
0  False   True
1   True   True
2  False  False
3   True   True

Another solution is use apply with contains:

mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball', na=False))
#if always non NaNs, no Nones
#mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball'))
print (mask)
     ids    id2
0  False   True
1   True   True
2  False  False
3   True   True

Then for filtering use DataFrame.any for check at least one True per rows or DataFrame.all for check all values per rows:

df1 = df[mask.any(axis=1)]
print (df1)
   vals    ids    id2
0     1   None  uball
1     2  bball  mball
3     4  fball  zball

df2 = df[mask.all(axis=1)]
print (df2)
   vals    ids    id2
1     2  bball  mball
3     4  fball  zball
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @piRSquared - Yes, but then problem use `all` – jezrael Jun 26 '18 at 14:32
  • 2
    @piRSquared - Change solutions for working with NaNs, Nones :) – jezrael Jun 26 '18 at 15:39
  • Is there something more memory friendly? This uses all of my 64gb of ram (1.9 million records/264 variables) – Cibic Jun 26 '18 at 16:02
  • @Cibic - If possible, use `regex=False`, check [this](https://stackoverflow.com/a/37894064) – jezrael Jun 26 '18 at 16:04
  • Also, this comes back with this error on latest pandas: "TypeError: ("argument of type 'int' is not iterable", 'occurred at index Code')" – Cibic Jun 26 '18 at 16:05
  • I need the regex to make it case insensitive you see – Cibic Jun 26 '18 at 16:06
  • 1
    @Cibic - hmm, some idea `mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in str(x).lower() if pd.notnull(x) else False)` or `mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball', na=False, regex=False, case=False))` – jezrael Jun 26 '18 at 16:11
  • 1
    For `case insensitive` is not necessary `regex=True`, just tested – jezrael Jun 26 '18 at 16:13
5

stack

If you select just the things that might have 'ball' which are columns that are of dtype object, then you can stack the resulting dataframe into a series object. At that point you can perform pandas.Series.str.contains and unstack the results back into a dataframe.

df.select_dtypes(include=[object]).stack().str.contains('ball').unstack()

     ids    id2
0   True   True
1   True   True
2  False  False
3   True   True
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Uses too much ram if I use 1.9 million records and 264 variables – Cibic Jun 26 '18 at 16:01
  • 1
    Got it working with this: df.loc[df.select_dtypes(include='O').stack().str.contains('text', na=False, regex=False, case=False).unstack().any(1)] – Cibic Jun 27 '18 at 20:19