0

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:

<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>

<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.

My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.

My Current setup is:

searchPattern = '>\d{8}</CustomerId'

Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.

Idea:

searchPattern = '\bd{16}\b'

My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.

Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.

Thanks again all,

darcmasta

jvasallo
  • 13
  • 2

3 Answers3

0
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""

import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output  [('OrderId', '123456'), ('CustomerId', '44444444')]
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

You might consider using a look-back operator in your regex to make it easy for a human to read:

import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
DrSkippy
  • 390
  • 1
  • 3
  • Well the trick is that I need to look for only CustomerId's not OrderId's. Plus CustomerId sometimes gets whacked up and reports as 44444444. – jvasallo Aug 15 '12 at 18:54
0

You should be using raw string literals:

searchPattern = r'\b\d{16}\b'

The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

MRAB
  • 20,356
  • 6
  • 40
  • 33