5

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:

oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")

In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:

oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)

Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.

Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • 2
    The [`regex`](https://pypi.python.org/pypi/regex) module supports variable-width lookbehinds. See also http://stackoverflow.com/q/11640447/3001761 – jonrsharpe Jul 22 '15 at 13:16
  • @jonrsharpe Thank you for that, that's good to know! Looking at the answers below, though, I'm starting to second guess my reliance on lookarounds. But again, thanks for pointing me to the `regex` module. – tblznbits Jul 22 '15 at 13:19

3 Answers3

3

Notice that if you can use groups, you generally do not need lookbehinds. So how about

match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
    text = match.group(1)

In practice:

>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
  • Thanks for the reply, Antii! You and stribizhev have the same idea and it seems like this is the best practice over all. Time to rewrite some code... – tblznbits Jul 22 '15 at 13:21
2

You need to use capture groups in this case you described:

"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"

will become

r"ORIG\s?:\s?/\s?([A-Z0-9]+)"

The value will be in .group(1). Note that raw strings are preferred.

Here is a sample code:

import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)

IDEONE demo

Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • You bring up a really good point. Looks like it's time to start rethinking my regex approach. – tblznbits Jul 22 '15 at 13:20
  • The point is actually this: if you do not need overlapping matches, use a capturing group approach. Look-behinds are resource consuming, are just costly in terms of performance. Sometimes, there is no big difference, but if a lookbehind is lengthy, the difference might be visible. – Wiktor Stribiżew Jul 22 '15 at 13:23
1
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)

You can directly use findall which will return all the groups in the regex if present.

vks
  • 67,027
  • 10
  • 91
  • 124