285

Assume we have a data frame in Python Pandas that looks like this:

df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})

Or, in table form:

ids    vals
aball   1
bball   2
cnut    3
fball   4

How do I filter rows which contain the key word "ball?" For example, the output should be:

ids    vals
aball   1
bball   2
fball   4
Cleb
  • 25,102
  • 20
  • 116
  • 151
John Knight
  • 3,083
  • 2
  • 14
  • 9

4 Answers4

535
In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids  vals
0  aball     1
1  bball     2
3  fball     4
Amit
  • 19,780
  • 6
  • 46
  • 54
  • 25
    How would you invert this to find all the rows that did not contain the string? – user4896331 Mar 01 '17 at 09:50
  • 97
    @user4896331 - `df[~df['ids'].str.contains("ball")]`, `~` negates the condition – Amit Mar 01 '17 at 12:57
  • If it was a specific word, to negate, could you also use: df = df[df.id != "ball"] – Brian Apr 26 '17 at 17:59
  • @Brian - Yes, in the above df you can try `df = df[df.ids != "aball"]` to see it in action. – Amit Apr 27 '17 at 02:03
  • @Amit: I need to access columns by id instead of name. However trying str gives me an error [AttributeError: 'DataFrame' object has no attribute 'str'] Does new pandas not support it or is it because of number based access? – Sameer Mahajan Oct 23 '17 at 10:01
  • never mind. I was using [[]]. After using [] it works. – Sameer Mahajan Oct 23 '17 at 10:16
  • @Amit, what if i have a list of strings that partially matches with column values. I want to extract all those rows. For example, rows partially matchin "ball", and rows partially matching "bat" etc? – Balki Mar 12 '19 at 16:34
  • To search for dot `.` don't forget to escape it `df[df['ids'].str.contains("\.")]` – Brambor Oct 21 '19 at 15:23
  • Guys, is it possible to write this with query? – Phoenix Jul 05 '22 at 12:28
  • I'm using `str.contains("'ball | bat | cat | man")` to filter using multiple keywords., which works fine. Is it possible to print which row matched which keyword? – vine_J Sep 02 '22 at 13:57
169
df[df['ids'].str.contains('ball', na = False)] # valid for (at least) pandas version 0.17.1

Step-by-step explanation (from inner to outer):

  • df['ids'] selects the ids column of the data frame (technically, the object df['ids'] is of type pandas.Series)
  • df['ids'].str allows us to apply vectorized string methods (e.g., lower, contains) to the Series
  • df['ids'].str.contains('ball') checks each element of the Series as to whether the element value has the string 'ball' as a substring. The result is a Series of Booleans indicating True or False about the existence of a 'ball' substring.
  • df[df['ids'].str.contains('ball')] applies the Boolean 'mask' to the dataframe and returns a view containing appropriate records.
  • na = False removes NA / NaN values from consideration; otherwise a ValueError may be returned.
Jubbles
  • 4,450
  • 8
  • 35
  • 47
  • Could you explain what that code is doing and how it works, please? – Kevin Jan 16 '15 at 01:36
  • How to work with partial match and grab the remaining string with the partial match `'#':str` something like this? – Sitz Blogz Jun 28 '16 at 01:50
  • 6
    Absolutely love it when someone does a step-by-step explanation. It really helps with comprehension! – user4896331 Mar 01 '17 at 09:49
  • 4
    And if you substitute `'ball'` with `'|'.join(list_of_balls)` one can apply a list of strings to the search. While the `'|'.join(list_of_balls)` creates a RegEx with OR to search for vaild strings – venti Sep 01 '17 at 08:38
  • Thanks a lot. BTW, would you care to explain how the code shall change if I want to filter the rows with the exact value that I want instead of containing it? Thanks – Bowen Liu Sep 13 '18 at 18:23
  • 5
    Note that you can make the filter case insensitive adding `case=False`, reuslting in `df[df['ids'].str.contains('ball', case=False,na = False)]` – Antonio Jun 26 '20 at 08:43
  • @Antonio thanks for that, `df[df['ids'].str.lower().contains('ball')]` throws an error – Tim Stack Jan 21 '21 at 11:04
  • @TimStack, The reference is in the following link. Please, let me know what error do you get? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html – Antonio Jan 28 '21 at 19:56
  • `lower()` throws the error (`AttributeError: 'Series' object has no attribute 'contains'`). This does not matter however as you pointed out that `case=False` can be used instead – Tim Stack Jan 29 '21 at 07:18
  • While for the specific case of ignoring case it's better to use the built-in way to do it through `case=False`, not that in general you _can_ chain operations like this, but you need to call the `str` attribute every time you do: `df.ids.str.lower().str.contains('ball')` – Puff Apr 27 '22 at 23:26
35
>>> mask = df['ids'].str.contains('ball')    
>>> mask
0     True
1     True
2    False
3     True
Name: ids, dtype: bool

>>> df[mask]
     ids  vals
0  aball     1
1  bball     2
3  fball     4
user3820991
  • 2,310
  • 5
  • 23
  • 32
19

If you want to set the column you filter on as a new index, you could also consider to use .filter; if you want to keep it as a separate column then str.contains is the way to go.

Let's say you have

df = pd.DataFrame({'vals': [1, 2, 3, 4, 5], 'ids': [u'aball', u'bball', u'cnut', u'fball', 'ballxyz']})

       ids  vals
0    aball     1
1    bball     2
2     cnut     3
3    fball     4
4  ballxyz     5

and your plan is to filter all rows in which ids contains ball AND set ids as new index, you can do

df.set_index('ids').filter(like='ball', axis=0)

which gives

         vals
ids          
aball       1
bball       2
fball       4
ballxyz     5

But filter also allows you to pass a regex, so you could also filter only those rows where the column entry ends with ball. In this case you use

df.set_index('ids').filter(regex='ball$', axis=0)

       vals
ids        
aball     1
bball     2
fball     4

Note that now the entry with ballxyz is not included as it starts with ball and does not end with it.

If you want to get all entries that start with ball you can simple use

df.set_index('ids').filter(regex='^ball', axis=0)

yielding

         vals
ids          
ballxyz     5

The same works with columns; all you then need to change is the axis=0 part. If you filter based on columns, it would be axis=1.

Cleb
  • 25,102
  • 20
  • 116
  • 151
  • Just saw your post as I was looking for a similar solution. I used the filter method you mentioned but failed to get desired results. The question is [here](https://stackoverflow.com/questions/52321121/pandas-failed-to-filter-rows-containing-a-specific-value-in-the-index-column-af/52321165?noredirect=1#comment91590757_52321165) Would you mind taking a look nad see if there's anything wrong with my code/method? IMO it shall be super simple albeit my being unable to make it work. – Bowen Liu Sep 13 '18 at 21:05
  • @BowenLiu: I added an answer there, please check whether that works for you. – Cleb Sep 13 '18 at 21:36
  • Just got on my laptop for the night. Will do that real soon, thank you so much. – Bowen Liu Sep 14 '18 at 01:20
  • Note that contains also works with regex, the wording in this answer is a tiny bit misleading in that regard. – Puff Apr 27 '22 at 23:30