So I've written this, which is horrific:
def parse_results(string):
space = r"([\s\t]{0,5})"
building_type = r"(([Uu]nit|[Ss]tudio|[Ff]lat)?)"
street_type = (r"((\d+)(\&|\-)*(\d*)(\w*)(\w*)(\s*)(\w*)(\s*)(\w*)(\s*)"
r"([Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|"
r"[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|"
r"[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial"
r"[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen))")
line_1 = r"(\w*)"
line_2 = r"(\w*)"
line_3 = r"(\w*)"
line_4 = r"(\w*)"
line_5 = r"(\w*)"
postcode = r"(([A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}))"
pattern = re.compile(rf"({building_type}{space}{street_type}{space}"
rf"{line_1}{space}{line_2}{space}{line_3}{space}"
rf"{line_4}{space}{line_5}{space}{postcode})")
try:
matches = pattern.finditer(string)
for match in matches:
address = re.sub(r"\s+", r" ", match.group(1))
return address
except Exception as e:
return (f"Error looking for address, exception {e}")
Its purpose is to look for UK addresses in a large text corpus I am using for machine learning training. It is unusably slow however because of the backtracking .
After research, the solution appears to be to use atomic groupings, similar to how it is done in Ruby.
The Python RE module doesn't support this out of the box, however there are workarounds such as this:
Do Python regular expressions have an equivalent to Ruby's atomic grouping?
And apparently there is a Python Regex module that does support atomix groupings out of the box, but almost no one seems to be talking about it in tutorials.
Two questions:
Which is the best approach? RE module work around or the Regex module?
Can someone point me in the direction of examples so I can figure this out for my usecase?
Thank you!