3

So I've written this, which is horrific:

def parse_results(string):
    space = r"([\s\t]{0,5})"
    building_type = r"(([Uu]nit|[Ss]tudio|[Ff]lat)?)"
    street_type = (r"((\d+)(\&|\-)*(\d*)(\w*)(\w*)(\s*)(\w*)(\s*)(\w*)(\s*)"
                   r"([Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|"
                   r"[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|"
                   r"[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial"
                   r"[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen))")
    line_1 = r"(\w*)"
    line_2 = r"(\w*)"
    line_3 = r"(\w*)"
    line_4 = r"(\w*)"
    line_5 = r"(\w*)"
    postcode = r"(([A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}))"
    pattern = re.compile(rf"({building_type}{space}{street_type}{space}"
                         rf"{line_1}{space}{line_2}{space}{line_3}{space}"
                         rf"{line_4}{space}{line_5}{space}{postcode})")
    try:
        matches = pattern.finditer(string)
        for match in matches:
            address = re.sub(r"\s+", r" ", match.group(1))
        return address
    except Exception as e:
        return (f"Error looking for address, exception {e}")

Its purpose is to look for UK addresses in a large text corpus I am using for machine learning training. It is unusably slow however because of the backtracking .

After research, the solution appears to be to use atomic groupings, similar to how it is done in Ruby.

The Python RE module doesn't support this out of the box, however there are workarounds such as this:

Do Python regular expressions have an equivalent to Ruby's atomic grouping?

And apparently there is a Python Regex module that does support atomix groupings out of the box, but almost no one seems to be talking about it in tutorials.

Two questions:

  1. Which is the best approach? RE module work around or the Regex module?

  2. Can someone point me in the direction of examples so I can figure this out for my usecase?

Thank you!

Sandy Lee
  • 221
  • 1
  • 9
  • Whether you are using `re` or `regex`, you will have to fix your pattern, as it is [catastrophic backtracking prone](https://regex101.com/r/3EvF6O/1). Atomic groupings are not necessary here, you need optional groupings with obligatory patterns. Also, you need to fix your alternations that may start matching at the same location inside a string. – Wiktor Stribiżew Apr 20 '21 at 08:58
  • Thanks for that. So do you have an example of what you mean that I can work with to solve the complete problem? Interestingly, the page on regex101.com that you directed me to links to the same article as I have above that mentions atomic groupings and posssive quantifiers as a solution. What I am after is a handle of what that actually means for my specific use case. Thanks! – Sandy Lee Apr 20 '21 at 09:06
  • What about [this regex](https://regex101.com/r/OKGGn2/2)? See [this Python demo](https://ideone.com/W4j2Si). – Wiktor Stribiżew May 16 '21 at 09:57
  • Thank you for taking a look at this for me, I'll try this out and see how I get on. Thank you! – Sandy Lee May 17 '21 at 08:53
  • I had to tweak it a little bit, but got there in the end. I wasn't perfectly clear with my brief. Your advice took me in a different direction and I fixed it. Thank you! – Sandy Lee Jun 09 '21 at 16:17
  • Ok, I can only answer the question you asked, I hope it will work as intended all the way through. – Wiktor Stribiżew Jun 09 '21 at 17:14

1 Answers1

1

Whether you are using re or regex, you will have to fix your pattern, as it is catastrophic backtracking prone. Atomic groupings are not necessary here, you need optional groupings with obligatory patterns. Also, you need to fix your alternations that may start matching at the same location inside a string.

You can use

(?:[Uu]nit|[Ss]tudio|[Ff]lat)?\s{0,5}\d+[&-]*\w*\s*(?:[Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen)(?:\s{1,5}\w+){0,5}\s{0,5}[A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}

See the regex demo.

See the Python demo:

import re
 
def parse_results(string):
    space = r"\s{0,5}"
    building_type = r"(?:[Uu]nit|[Ss]tudio|[Ff]lat)?"
    street_type = (r"\d+[&-]*\w*\s*"
                   r"(?:[Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|"
                   r"[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|"
                   r"[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial"
                   r"[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen)")
    postcode = r"[A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}"
    pattern = re.compile(rf"{building_type}{space}{street_type}(?:\s{{1,5}}\w+){{0,5}}"
                         rf"{space}{postcode}")
    print(pattern.pattern)
    try:
        return [re.sub(r"\s+", r" ", x) for x in pattern.findall(string)]
    except Exception as e:
        return (f"Error looking for address, exception {e}")
 
print(parse_results('Unit 23 End AS4 0SS'))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563