1

I'm trying to extract address details from very ugly free text:

import regex

pat_addr_verbose = """(?ix)       # case insensitive and verbose flag
(?:(?:BND|BY|CNR|OF)\W+)*         # non-capturing (list)
(?:(?!RD|HWY|TRAIL|St)           # negative lookahead (list of street types)
(?:                              # either
(?P<n_start>\d+)-(?P<n_end>\d+)  # number sequence
|(?<!-)(?P<n>\d+)                      # single number
)\W+)?                               # No number, maybe non word character follows
(?P<name>
(?:
(?!RD|HWY|TRAIL|St)\w+\W*)+)\W+   # capturing words not preceded by (list of street types)
(?P<type>RD|HWY|TRAIL|St)*             # non-capturing (list of street types)
"""

pat_addr = regex.compile(pat_addr_verbose, regex.IGNORECASE & regex.VERBOSE)

text = """BND BY THOMAS RAIL TRAIL, 7 SNOW WHITE HWY & MICKEY RD,
337-343 BOGEYMAN RD, 4, 8, 9-13, 16-18 Fictional Rd & 17 Elm St"""

regex.findall(pat_addr, text)

I'm getting the right results for simple addresses, but I'm failing to get the many different street numbers in Fictional Road

[m.groupdict() for m in pat_addr.finditer(text)]

[{'n': None,
'n_end': None,
'n_start': None,
'name': 'THOMAS RAIL',
'type': 'TRAIL'},
{'n': '7',
'n_end': None,
'n_start': None,
'name': 'SNOW WHITE',
'type': 'HWY'},
{'n': None, 'n_end': None, 'n_start': None, 'name': 'MICKEY', 'type': 'RD'},
{'n': None,
'n_end': '343',
'n_start': '337',
'name': 'BOGEYMAN',
'type': 'RD'},
{'n': '4',
'n_end': None,
'n_start': None,
'name': '8, 9-13, 16-18 Fictional',
'type': 'Rd'},
{'n': '17', 'n_end': None, 'n_start': None, 'name': 'Elm', 'type': 'St'}]

I wonder if it is possible to either get a list of numbers (doesn't matter if they're not named) or a dict for them in regex?

EDIT: This is what I expect to get:

Option 1:

{'numbers': 
    [
        {
            'n': '4',
            'n_end': None,
            'n_start': None,
        },
        {
            'n': '8',
            'n_end': None,
            'n_start': None,
        },
        {
            'n': None,
            'n_end': '13',
            'n_start': '9',
        },
        {
            'n': None,
            'n_end': '18',
            'n_start': '16',
        }
    ],
'name': 'Fictional',
'type': 'Rd'},

Option 2:

    {'numbers': 
    [
        '4',
        '8',
        '9-13',
        '16-18'
    ],
'name': '8, 9-13, 16-18 Fictional',
'type': 'Rd'},
dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • Can you post results that you'd expect to get? – Colin Oct 06 '17 at 01:31
  • @Colin, here you go. – dmvianna Oct 06 '17 at 01:51
  • 1
    you are essentially asking for [capturing an arbitrary number of groups](https://stackoverflow.com/questions/3537878/how-to-capture-an-arbitrary-number-of-groups-in-javascript-regexp/3537914#3537914), which is something regex is not capable of doing. – R Nar Oct 06 '17 at 01:53
  • @RNar, maybe not in all flavours, but the answer you refer to says it is possible in .NET and not in JavaScript. It doesn't mention Python. – dmvianna Oct 06 '17 at 01:56
  • Python is among the ones that take only the last capture – R Nar Oct 06 '17 at 14:39
  • @RNar, the `regex` module (not in the standard library) has a method to recover all captures, and the accepted answer took advantage of that. Have a look. – dmvianna Oct 06 '17 at 15:15
  • I stand corrected! I guess I should start paying attention to the new regex module... – R Nar Oct 06 '17 at 16:47

1 Answers1

1
(?ix)                             # case insensitive and verbose flag
(?:(?:BND|BY|CNR|OF)\W+)*         # non-capturing (list)

(?:                               #Number non capture Start
(?!RD|HWY|TRAIL|St)               # negative lookahead (list of street types)
                                  # EITHER
(?P<numbers>\d+-\d+|\d+)          #double number OR single number
\W+                               # No number, maybe non word character follows
)                                 #Number non capture End
*?                                #This Number group repeats to produce numbers

(?P<name>
(?:
(?!RD|HWY|TRAIL|St)[A-Z]+\W*)+)\W+   # capturing words not preceded by (list of street types)
(?P<type>RD|HWY|TRAIL|St)*

UPDATED WITH NEW REGEX MODULE

The new regex module does allow repeated groups to be captured.

import regex

text='BND BY THOMAS RAIL TRAIL, 7 SNOW WHITE HWY & MICKEY RD, 337-343 BOGEYMAN RD, 4, 8, 9-13, 16-18 Fictional Rd & 17 Elm St'
reg=r'(?ix)(?:(?:BND|BY|CNR|OF)\W+)*(?:(?!RD|HWY|TRAIL|St)(?P<numbers>\d+-\d+|\d+)\W+)*?(?P<name>(?:(?!RD|HWY|TRAIL|St)[A-Z]+\W*)+)\W+(?P<type>RD|HWY|TRAIL|St)*'


def updateD(m):
  d=m.groupdict()
  d['numbers']=m.captures('numbers')
  return d

[updateD(m) for m in regex.finditer(reg,text)]

OUTPUT

[
  {
   'numbers': [],
   'name': 'THOMAS RAIL',
   'type': 'TRAIL'
  }, 
  {
   'numbers': ['7'],
   'name': 'SNOW WHITE',
   'type': 'HWY'
  }, 
  {
   'numbers': [],
   'name': 'MICKEY',
   'type': 'RD'
  }, 
  {
   'numbers': ['337-343'],
   'name': 'BOGEYMAN',
   'type': 'RD'
  }, 
  {
   'numbers': ['4', '8', '9-13', '16-18'],
   'name': 'Fictional',
   'type': 'Rd'
  }, 
  {
   'numbers': ['17'],
   'name': 'Elm',
   'type': 'St'
  }
]
kaza
  • 2,317
  • 1
  • 16
  • 25
  • I made an edit specifying the expected result, please have a look. Thank you, however. I hadn't thought of getting the whole sequence. That would allow a second pass (but I would rather avoid it if possible). – dmvianna Oct 06 '17 at 01:49
  • @dmvianna Just a bit confused as you introduced a new field `numbers`. Does that mean you want that to be a main stay in all entries? I've a slightly different version see if it works for you. You can extend it on that note. Mind you I'm using the new regex module for the first time. – kaza Oct 06 '17 at 02:28
  • Thanks for your answer. Yes, that’s the desired result. It would be great to get it using a single regex, but I’ll use more steps if necessary. Your answer provides a good first pass. – dmvianna Oct 06 '17 at 02:31
  • @dmvianna just updated the answer, not happy the way it looks though :-( – kaza Oct 06 '17 at 02:44
  • @dmvianna found a better way to do this! Hopefully this would cater to your needs! – kaza Oct 06 '17 at 03:34
  • 1
    @dmvianna see the latest as per your OP. – kaza Oct 06 '17 at 06:32
  • I think you should remove the references to past edits from your answer, as they will not be relevant for future readers. Well done! – dmvianna Oct 06 '17 at 15:19