1

I'm trying to create a regex to match numbers 1-12 for the months of year(where the first digit is optional) and 1-31 for days of the month without appending all the numbers from 1 to 12.(Just imagine memory if this was for 1 - 1million)

pd.Series(["some text8some text","some text13some text", "05"]).str.extract('(?P<mm>[1][012]|(?:[0])?[1-9])') 

Works on the 8 properly but on 13 instead of ignoring it matches to 1. So I tried

pd.Series(["some text8some text","13some text", "05"]).str.extract('(?P<mm>[1][012]|(?:[0])?[1-9][^0-9])')

But it forces me to have a character after 8 otherwise does not match.

Could someone please help with this regex negation which is forcing me to have a character after 8 to match?

The desired output for this is

0: 8
1: Nan 
2: 5

Since there is no whitespace, word boundary will not work thus forcing us to use regex-negation.

dev
  • 39
  • 4
  • Can you be more specific? Give us an input example and desired output? – Daniel Trugman Oct 26 '17 at 11:49
  • Yes the desired output for 8 is 8 and output for 13 is Nan. Thanks, i'll edit the question – dev Oct 26 '17 at 11:54
  • It is different from those questions though :| The original intent was for an intuitive was to use regex negation, since it need not have a word boundary or whitespace character separating the text from the number. However the alternative solution will work fine I hope on the dataset. Thanks @Jan – dev Oct 26 '17 at 12:12
  • There is no whitespace to be used as word boundary – dev Oct 26 '17 at 12:19
  • @dev: Have a look here with lookarounds: https://regex101.com/r/kFnIsJ/1 – Jan Oct 26 '17 at 13:44
  • @Jan Thanks so much! regex101 is really helpful and I kept digging to find `https://stackoverflow.com/questions/21300197/python-regex-to-find-whitespace-end-of-string-and-or-word-boundary` which may be of use to someone. I'm new to this lookup and discovered the forms of negative lookup to write crisp regex – dev Oct 26 '17 at 17:18

1 Answers1

2

You need to use anchors or word boundaries:

\b(?:1[0-2]|[1-9])\b

See a demo on regex101.com.


With pandas this might be:
import pandas as pd

df = pd.Series(["8","13", "text in between 13 as well", "here is an 8 hidden"]).str.extract(r'(?P<mm>\b(?:1[0-2]|[1-9])\b)') 
print(df)

This yields

0      8
1    NaN
2    NaN
3      8
Name: mm, dtype: object
Jan
  • 42,290
  • 8
  • 54
  • 79
  • No, I have data such that the numbers could be in between text – dev Oct 26 '17 at 11:56
  • @dev: Then use word boundaries, have updated the answer and the demo. – Jan Oct 26 '17 at 11:57
  • I'm sorry, after executing it does not match 8 – dev Oct 26 '17 at 11:59
  • @dev: Yes, it does, check the code. – Jan Oct 26 '17 at 12:01
  • It does not work with word boundaries.. See this `pd.Series(["8","13", "text in between13as well", "here is an8hidden"]).str.extract(r'(?P\b(?:1[0-2]|[1-9])\b)') ` does not match the hidden 8. There is no whitespace to be used as word boundary. Thanks for the help though @Jan – dev Oct 26 '17 at 12:18