Pandas: select rows if keyword appears in any column

Question

I know there is a relevant thread about searching for a string in one column (here) but how does one use pd.Series.str.contains(pattern) across all columns?

df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball'],
'id2': [u'uball', u'mball', u'pnut', u'zball']})


In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids  vals
0  aball     1
1  bball     2
3  fball     4

jezrael · Answer 1 · 2018-06-26T15:39:17.323

5

Use select_dtypes for only object columns (obviously strings) with applymap and in:

df = pd.DataFrame({'vals': [1, 2, 3, 4], 
                   'ids': [None, u'bball', u'cnut', u'fball'],
                   'id2': [u'uball', u'mball', u'pnut', u'zball']})
print (df)
   vals    ids    id2
0     1   None  uball
1     2  bball  mball
2     3   cnut   pnut
3     4  fball  zball

mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in x if pd.notnull(x) else False)
#if always non NaNs, no Nones
#mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in x)
print (mask)
     ids    id2
0  False   True
1   True   True
2  False  False
3   True   True

Another solution is use apply with contains:

mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball', na=False))
#if always non NaNs, no Nones
#mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball'))
print (mask)
     ids    id2
0  False   True
1   True   True
2  False  False
3   True   True

Then for filtering use DataFrame.any for check at least one True per rows or DataFrame.all for check all values per rows:

df1 = df[mask.any(axis=1)]
print (df1)
   vals    ids    id2
0     1   None  uball
1     2  bball  mball
3     4  fball  zball

df2 = df[mask.all(axis=1)]
print (df2)
   vals    ids    id2
1     2  bball  mball
3     4  fball  zball

edited Jun 26 '18 at 15:39

answered Jun 26 '18 at 14:23

jezrael

822,522
95
1,334
1,252

@piRSquared - Yes, but then problem use `all` – jezrael Jun 26 '18 at 14:32
2

@piRSquared - Change solutions for working with NaNs, Nones :) – jezrael Jun 26 '18 at 15:39
Is there something more memory friendly? This uses all of my 64gb of ram (1.9 million records/264 variables) – Cibic Jun 26 '18 at 16:02
@Cibic - If possible, use `regex=False`, check [this](https://stackoverflow.com/a/37894064) – jezrael Jun 26 '18 at 16:04
Also, this comes back with this error on latest pandas: "TypeError: ("argument of type 'int' is not iterable", 'occurred at index Code')" – Cibic Jun 26 '18 at 16:05
I need the regex to make it case insensitive you see – Cibic Jun 26 '18 at 16:06
1

@Cibic - hmm, some idea `mask = df.select_dtypes(include=[object]).applymap(lambda x: 'ball' in str(x).lower() if pd.notnull(x) else False)` or `mask = df.select_dtypes(include=[object]).apply(lambda x: x.str.contains('ball', na=False, regex=False, case=False))` – jezrael Jun 26 '18 at 16:11
1

For `case insensitive` is not necessary `regex=True`, just tested – jezrael Jun 26 '18 at 16:13

score 5 · Accepted Answer · answered Jun 26 '18 at 14:23

5

`stack`

If you select just the things that might have 'ball' which are columns that are of dtype object, then you can stack the resulting dataframe into a series object. At that point you can perform pandas.Series.str.contains and unstack the results back into a dataframe.

df.select_dtypes(include=[object]).stack().str.contains('ball').unstack()

     ids    id2
0   True   True
1   True   True
2  False  False
3   True   True

answered Jun 26 '18 at 14:23

piRSquared

285,575
57
475
624

Uses too much ram if I use 1.9 million records and 264 variables – Cibic Jun 26 '18 at 16:01
1

Got it working with this: df.loc[df.select_dtypes(include='O').stack().str.contains('text', na=False, regex=False, case=False).unstack().any(1)] – Cibic Jun 27 '18 at 20:19

Pandas: select rows if keyword appears in any column

2 Answers2

`stack`

Linked