I have a pandas dataframe with string values and I would like to be able to return a subset of the dataframe where the values contain some substring. This is easy to do on a series in this way (example adapted from pandas documentation):
import pandas as pd
import numpy as np
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4[s4.str.contains('A', na=False)]
0 A
3 Aaba
6 CABA
dtype: object
I would expect searching for substrings in all the columns of a dataframe to work the same as for series, but there are no .str methods for dataframes. You can filter a dataframe for an exact match of a string like this:
df = pd.concat((s4, s4.shift(1)), axis=1)
0 1
0 A NaN
1 B A
2 C B
3 Aaba C
4 Baca Aaba
5 NaN Baca
6 CABA NaN
7 dog CABA
8 cat dog
filtered_df = df[df == 'Baca']
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Baca NaN
5 NaN Baca
6 NaN NaN
7 NaN NaN
8 NaN NaN
result = df[df contains 'ac']
I would hope that would return the same thing as filtered_df = df[df == 'Baca']
but it's invalid syntax. I tried using df.apply to apply the series.str.contains method to each series of the dataframe. If that is a viable solution I wasn't able to work it out. I'm using python 3.5, and pandas 0.18 on Linux/Ubuntu.