5

In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)

enter image description here enter image description here

StackUser
  • 142
  • 8
  • 1
    This is explained very well in this SO [question](https://stackoverflow.com/questions/2703029/why-regular-expressions-non-capturing-group-is-not-working) – patrick May 18 '18 at 18:42

2 Answers2

5

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

Just do not put them into the () that define the capturing:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"
cs95
  • 379,657
  • 97
  • 704
  • 746
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
3

I think you're trying too complicated a regex. What about:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318
sacuL
  • 49,704
  • 8
  • 81
  • 106