How to use regex non-capturing groups format in Python

Question

In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)

This is explained very well in this SO [question](https://stackoverflow.com/questions/2703029/why-regular-expressions-non-capturing-group-is-not-working) — patrick, May 18 '18 at 18:42

score 5 · Accepted Answer · edited May 18 '18 at 18:49

5

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

Just do not put them into the () that define the capturing:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"

edited May 18 '18 at 18:49

cs95

379,657
97
704
746

answered May 18 '18 at 18:41

Patrick Artner

50,409
9
43
69

2

This returns a `:1: DeprecationWarning: invalid escape sequence \d`. with compiler warnings turned on. I suggest you use raw strings. – cs95 May 18 '18 at 18:45
@coldspeed very good suggestion - I just tested in pyfiddle and they do not show warnings. thx – Patrick Artner May 18 '18 at 18:47
Hmm, our patterns are ditto. I'll delete my answer ;-) – cs95 May 18 '18 at 18:48
@coldspeed not quite - I used a space, forgetting about \s all the time – Patrick Artner May 18 '18 at 18:52

sacuL · Answer 2 · 2018-05-18T18:51:02.677

3

I think you're trying too complicated a regex. What about:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318

edited May 18 '18 at 18:51

answered May 18 '18 at 18:40

sacuL

49,704
8
81
106

1

This also returns a DeprecationWarning with compiler warnings enabled, because your string isn't a raw-string. – cs95 May 18 '18 at 18:46
Fair enough, am I right in saying that `r'-(.*)u'` would solve that? I'm not all that familiar with it TBH – sacuL May 18 '18 at 18:50
1

Indeed, it would. ;-) – cs95 May 18 '18 at 18:52

How to use regex non-capturing groups format in Python

2 Answers2