Select data using a regular expression

Question

I have a dataframe like this

import pandas as pd

df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})

        a  b
0     abc  0
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

I now want to select all columns of this dataframe where the entries in column a start with r followed by five numbers.

From here I learned how one would do this if it started just with r without the numbers:

print df.loc[df['a'].str.startswith('r'), :]

        a  b
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

Something like this

print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]

does of course not work. How would one do this properly?

piRSquared · Accepted Answer · 2017-07-06T15:47:44.200

7

Option 1
pd.Series.str.match

df.a.str.match('^r\d{5}$')

1     True
2     True
3    False
4     True
5    False
Name: a, dtype: bool

Use it as a filter

df[df.a.str.match('^r\d{5}$')]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Option 2
Custom list comprehension using string methods

f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]

[False, True, True, False, True, False]

Use it as a filter

df[[f(s) for s in df.a.values.tolist()]]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Timing

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]

10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop

edited Jul 06 '17 at 15:47

answered Jul 06 '17 at 15:25

piRSquared

285,575
57
475
624

2

As `str.match` is using `re.match` the pattern could be changed to `'r\d{5}'` as by default it matches from the beginning of the string – EdChum Jul 06 '17 at 15:35
2

I find it depressing to see list comprehension outperforming the supposed vectorised pandas methods, useful to add the timings – EdChum Jul 06 '17 at 15:49
@EdChum Agreed! I know the devs have a lot to do. And I keep telling myself that when I can clear my schedule a bit, I'll start contributing to `pandas`... I still expect that I will at some point. – piRSquared Jul 06 '17 at 15:53

EdChum · Answer 2 · 2017-07-06T15:44:01.210

6

You can use str.contains and pass a regex pattern:

In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]

Out[112]: 
        a  b
1  r00001  1
2  r00010  2
4  r01234  4

Here the pattern evaluates to ^r - start with character r, and then \d{5} looks for 5 digits

startswith looks for a character pattern, not a regex pattern which is why it fails

Regarding the difference between str.contains and str.match, they are analagous but str.contains uses re.search whilst str.match uses re.match which is more strict, see the docs.

edit

To answer your comment add $ so that it matches a specific number of characters, see related:

In[117]:
df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})
df

Out[117]: 
         a  b
0      abc  0
1  r000010  1
2   r00010  2
3     rfoo  3
4   r01234  4
5    r1234  5


In[118]:
df.loc[df['a'].str.match(r'r\d{5}$')]

Out[118]: 
        a  b
2  r00010  2
4  r01234  4

edited Jul 06 '17 at 15:44

answered Jul 06 '17 at 15:26

EdChum

376,765
198
813
562

Not part of the original post, but how would one now exclude the ones that have more than 5 numbers (or any other characters)? – Cleb Jul 06 '17 at 15:40
1

`r'^r\d{5}$` should handle this – EdChum Jul 06 '17 at 15:41
Great, thanks, unfortunately, I cannot upvote twice. I accept the other answer as he was slightly faster but thanks for the detailed explanations. – Cleb Jul 06 '17 at 15:44
1

Not an issue, glad I can help and add some explanations that are useful – EdChum Jul 06 '17 at 15:44
1

@Cleb **1** I don't mean to speak for EdChum, but I'm fairly confident that he'd share my sentiment. And that sentiment is, I know very well that his answer is every bit as good as mine and that you can only pick one... I'm ok with the outcome either way. **2** I've added timings to my post, I have **zero** expectations that `match` would work any faster that `contains`. My timings validate that. – piRSquared Jul 06 '17 at 15:51
@piRSquared rep isn't a big deal for me, attribution, quality of questions and answers is though. I think both answers contribute well to the question in different ways, I +1 your answer as soon as you posted it by the way – EdChum Jul 06 '17 at 15:54
@piRSquared: By "faster" I meant that you answered slightly before EdChum; I was not referring to the performance of your respective solutions. When there is more than one answer I like, I usually accept the one that appeared first (and upvote every solution that answers my question in a reasonable way). And thanks for the timings; indeed surprising that list comprehensions are faster than the pandas solutions... – Cleb Jul 06 '17 at 15:56
@Cleb I'd expect `match` to probably be slightly faster as by default it pattern matches from the beginning of the string so it's one less thing to worry about whilst `contains` is more general pattern matching anywhere in the string, you could say that `match` is semantically the correct thing to use here – EdChum Jul 06 '17 at 15:57
@EdChum I was moments from clicking "save edits" when I saw your answer pop up, Since it was `contains`, I deleted my edit and upvoted instead. *Oh* rep is a big deal for me at the moment... I want my mug and t-shirt (at least that is what I hope I get (-:) – piRSquared Jul 06 '17 at 16:01

Select data using a regular expression

2 Answers2