0

I am trying to identify road names within a string for span tagging. There can be more than one road name within the string but often there is only one.

For most of the roads the format is something like

"flat 14, 24-34 barrington street, London"

"23 the honourable lord barrington's street, London"

"23 the honourable lord barrington's street, 42 the dishonarable baron lordington's street, London"

These are easily captured using basic regex of the form (?<=\s)([a-z'\s])+(street) or ([a-z']+(\s)?)+(street)(?=,)

However sometimes an address will have the form

"land to the south of barrington street, London"

"plot 12 on barrington street, London"

There are a few key words that are almost always used in this situation the words are 'at', 'on', 'in','adjoining'.

I would like to make a regex that can match multiple words followed by 'street' but will not match any of the key words or the words that come before it in the sentence. In other words will extract the street name but not "plot 12 on".

I have attempted to use negative lookbehind but have not been successful in making it work. I have seen this answer but it doesn't seem appropriate for my use.

Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45

1 Answers1

1

You can use:

(?<!\S)(?:(?!\b(?:at|on|in|adjoining)\b)[^\n\d])*? street\b

The pattern matches:

  • (?<!\S) Assert a whitespace boundary to the left
  • (?: Non capture group
    • (?!\b(?:at|on|in|adjoining)\b) Negate lookahead, assert not any of the words directly to the right
    • [^\n\d] Match any char except a digit or a newline
  • )*? Close non capture group and optionally repeat as least as possible
  • street\b Match literally followed by a word boundary to prevent a partial match

See a Regex demo and a Python demo

Example code

import re

pattern = r"(?<!\S)(?:(?!\b(?:at|on|in|adjoining)\b)[^\n\d])*? street\b"

s = ("flat 14, 24-34 barrington street, London\n"
            "23 the honourable lord barrington's street, London\n"
            "23 the honourable lord barrington's street, 42 the dishonarable baron lordington's street, London\n"
            "land to the south of barrington street, London\n"
            "plot 12 on barrington street, London")

print(re.findall(pattern, s))

Output

[
'barrington street',
"the honourable lord barrington's street",
"the honourable lord barrington's street",
"the dishonarable baron lordington's street",
'land to the south of barrington street',
'barrington street'
]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70