Python Regex negation forces character to be present

Question

I'm trying to create a regex to match numbers 1-12 for the months of year(where the first digit is optional) and 1-31 for days of the month without appending all the numbers from 1 to 12.(Just imagine memory if this was for 1 - 1million)

pd.Series(["some text8some text","some text13some text", "05"]).str.extract('(?P<mm>[1][012]|(?:[0])?[1-9])')

Works on the 8 properly but on 13 instead of ignoring it matches to 1. So I tried

pd.Series(["some text8some text","13some text", "05"]).str.extract('(?P<mm>[1][012]|(?:[0])?[1-9][^0-9])')

But it forces me to have a character after 8 otherwise does not match.

Could someone please help with this regex negation which is forcing me to have a character after 8 to match?

The desired output for this is

0: 8
1: Nan 
2: 5

Since there is no whitespace, word boundary will not work thus forcing us to use regex-negation.

Can you be more specific? Give us an input example and desired output? — Daniel Trugman, Oct 26 '17 at 11:49
Yes the desired output for 8 is 8 and output for 13 is Nan. Thanks, i'll edit the question — dev, Oct 26 '17 at 11:54
It is different from those questions though :| The original intent was for an intuitive was to use regex negation, since it need not have a word boundary or whitespace character separating the text from the number. However the alternative solution will work fine I hope on the dataset. Thanks @Jan — dev, Oct 26 '17 at 12:12
@dev: Have a look here with lookarounds: https://regex101.com/r/kFnIsJ/1 — Jan, Oct 26 '17 at 13:44
@Jan Thanks so much! regex101 is really helpful and I kept digging to find `https://stackoverflow.com/questions/21300197/python-regex-to-find-whitespace-end-of-string-and-or-word-boundary` which may be of use to someone. I'm new to this lookup and discovered the forms of negative lookup to write crisp regex — dev, Oct 26 '17 at 17:18

Jan · Accepted Answer · 2017-10-26T12:01:05.340

2

You need to use anchors or word boundaries:

\b(?:1[0-2]|[1-9])\b

See a demo on regex101.com.

With pandas this might be:

import pandas as pd

df = pd.Series(["8","13", "text in between 13 as well", "here is an 8 hidden"]).str.extract(r'(?P<mm>\b(?:1[0-2]|[1-9])\b)') 
print(df)

This yields

0      8
1    NaN
2    NaN
3      8
Name: mm, dtype: object

edited Oct 26 '17 at 12:01

answered Oct 26 '17 at 11:54

Jan

42,290
8
54
79

No, I have data such that the numbers could be in between text – dev Oct 26 '17 at 11:56
@dev: Then use word boundaries, have updated the answer and the demo. – Jan Oct 26 '17 at 11:57
I'm sorry, after executing it does not match 8 – dev Oct 26 '17 at 11:59
@dev: Yes, it does, check the code. – Jan Oct 26 '17 at 12:01
It does not work with word boundaries.. See this `pd.Series(["8","13", "text in between13as well", "here is an8hidden"]).str.extract(r'(?P\b(?:1[0-2]|[1-9])\b)') ` does not match the hidden 8. There is no whitespace to be used as word boundary. Thanks for the help though @Jan – dev Oct 26 '17 at 12:18

Python Regex negation forces character to be present

1 Answers1