I am working on trying to locate some registration numbers in some documents. The best tool for this seems to be Pythons re module. I have created a regular expression that works but I am not able to make this work when I move to a named group.
Here is the original text I am trying to extract from
REGISTRATION NO. 874224207 PAGE 32
This regular expression works on Pythex
\s+\(?\s*REGISTRATION\s+NUMBER\)?[\.:]?\)?\s+[A-Z0-9#]{9}\s+|\s+\(?\s*REGISTRATION\s+NO\)?[\.:]?\)?\s+[A-Z0-9#]{9}\s+
But when I name my capture group theregis - that is all I want from the result I am not showing any match
\s+\(?\s*REGISTRATION\s+NUMBER\)?[\.:]?\)?\s+(?P<theregis>[A-Z0-9#]{9})\s+|\s+\(?\s*REGISTRATION\s+NO\)?[\.:]?\)?\s+(?P=theregis)\s+
Per the docs
- My named group is in parens
- I begin my group with a ?P
- My group has a name that is enclosed with <>
When I use my named group
- The group is placed in () 2 I begin with a ? and then P=
- The group name matched the name I gave it
- There are no extraneous characters in the parens where I have used the group name
- I tried changing the group name to something else - no luck
Finally - I used this as my model
p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')