REGEX: Parsing n digits with non numeric word boundaries

Question

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:

<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>

<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.

My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.

My Current setup is:

searchPattern = '>\d{8}</CustomerId'

Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.

Idea:

searchPattern = '\bd{16}\b'

My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.

Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.

Thanks again all,

darcmasta

Why are you parsing xml with regular expressions, as opposed to a proven XML parser? — Joe Day, Aug 15 '12 at 17:00
I feel a reference to [this question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) is all but mandatory. — Burhan Khalid, Aug 15 '12 at 19:02

score 0 · Accepted Answer · answered Aug 15 '12 at 17:03

txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""

import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output  [('OrderId', '123456'), ('CustomerId', '44444444')]

score 0 · Answer 2 · answered Aug 15 '12 at 17:06

0

You might consider using a look-back operator in your regex to make it easy for a human to read:

import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']

answered Aug 15 '12 at 17:06

DrSkippy

390
1
3

Well the trick is that I need to look for only CustomerId's not OrderId's. Plus CustomerId sometimes gets whacked up and reports as 44444444. – jvasallo Aug 15 '12 at 18:54

score 0 · Answer 3 · answered Aug 15 '12 at 18:58

You should be using raw string literals:

searchPattern = r'\b\d{16}\b'

The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

REGEX: Parsing n digits with non numeric word boundaries

3 Answers3