How to filter rows in pandas by regex

Question

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That's not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?

If you're not wedded to regexes, `foo[foo.b.str.startswith("f")]` will work. — DSM, Mar 10 '13 at 17:31
IMHO I think `foo[foo.b.str.match('(f.*)').str.len() > 0]` is a pretty good enough solution! More customizable and useful than startswith because it packs the versatility of regex in it. — tumultous_rooster, Nov 10 '15 at 01:39
this might be a bit late but in newer versions of pandas, the problem is fixed. the line `foo[foo.b.str.match('f.*')]` works in pandas 0.24.2 for me. — Behzad Mehrtash, Jul 06 '19 at 11:22

score 275 · Answer 1 · edited Feb 04 '19 at 20:34

275

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

edited Feb 04 '19 at 20:34

Dylan Pierce

4,313
3
35
45

answered Mar 11 '13 at 07:27

waitingkuo

89,478
28
112
118

15

How can the boolean be inverted? Found it: http://stackoverflow.com/questions/15998188/how-can-i-obtain-the-element-wise-logical-not-of-a-pandas-series – dmeu Apr 14 '14 at 14:31
5

Is it possible to get only those rows having True? – shockwave Aug 23 '18 at 13:20
7

@shockwave you should use: ```df.loc[df.b.str.contains('^f'), :]``` – Rafa Oct 16 '18 at 13:48
3

@shockwave Also you can just use `df[df.b.str.contains('^f'), :]` – David Jung Nov 05 '18 at 01:39

Erkan Şirin · Answer 2 · 2019-07-12T11:22:36.937

54

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values

edited Jul 12 '19 at 11:22

answered Jun 02 '17 at 18:57

Erkan Şirin

1,935
18
28

I modified to this and it worked for me `df[~df.CITY.str.contains('~.*', regex= True, na=False)]` – Patty Jula Jan 22 '20 at 18:35

score 28 · Answer 3 · edited Sep 02 '20 at 10:51

28

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

edited Sep 02 '20 at 10:51

ankostis

8,579
3
47
61

answered Dec 08 '15 at 02:57

Michael Siler

383
3
5

score 21 · Answer 4 · edited Feb 17 '16 at 23:37

21

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

edited Feb 17 '16 at 23:37

m0nhawk

22,980
9
45
73

answered Jun 26 '15 at 15:10

lakshman senathirajah

227
2
2

2

`frame`? and `'C:\test\test.txt'`? Seems like you are answering a different question. – tumultous_rooster Jun 26 '15 at 17:16
frame is df. its related to the same question, but it answers how to filter multiple columns('filename' and 'file_path') in one line code. – lakshman senathirajah Jun 29 '15 at 16:17

sparrow · Answer 5 · 2022-10-17T18:08:09.670

20

Building off of the great answer by user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

You can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(regex_filter,regex=myregex)]

edited Oct 17 '22 at 18:08

answered Oct 11 '18 at 17:37

sparrow

10,794
12
54
74

1

thanks, because of this I figured out a way to filter a column by arbitrary predicate. – jman Dec 10 '19 at 01:40

score 13 · Answer 6 · edited Jun 04 '18 at 12:10

13

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

edited Jun 04 '18 at 12:10

Jean-François Corbett

37,420
30
139
188

answered Feb 20 '18 at 11:35

tzviya

537
1
4
14

Add more context like and example function – Artem Dumanov Jun 01 '23 at 15:14

score 6 · Answer 7 · answered Nov 22 '21 at 11:25

Using Python's built-in ability to write lambda expressions, we could filter by an arbitrary regex operation as follows:

import re  

# with foo being our pd dataframe
foo[foo['b'].apply(lambda x: True if re.search('^f', x) else False)]

By using re.search you can filter by complex regex style queries, which is more powerful in my opinion. (as str.contains is rather limited)

Also important to mention: You want your string to start with a small 'f'. By using the regex f.* you match your f on an arbitrary location within your text. By using the ^ symbol you explicitly state that you want it to be at the beginning of your content. So using ^f would probably be a better idea :)

score 2 · Answer 8 · answered Dec 30 '18 at 03:12

2

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

answered Dec 30 '18 at 03:12

BENY

317,841
20
164
234

score 1 · Answer 9 · answered Apr 23 '22 at 18:52

You can use query in combination with contains:

foo.query('b.str.contains("^f").values')

Alternatively you can also use startswith:

foo.query('b.str.startswith("f").values')

However I prefer the first alternative since it allows you to search for multiple patterns using the | operator.

How to filter rows in pandas by regex

9 Answers9

Linked

Related