1

I am trying to isolate street address fields that begin with a digit, contain an underscore and end with a comma:

001 ALLAN Witham Ross 13 Every_Street, Welltown Greenkeeper 002 ALLARDYCE Margaret Isabel 49 Bell_Road, Musicville Housewife 003 ALLARDYCE Mervyn George 49 Bell_Road, Musicville Company Mngr

e.g

13 Every_Street, Welltown
49 Bell_Road, Musicville
49 Bell_Road, Musicville

My regex is

(?ms)([0-9]+\s[A-Z][a-z].+(?=,))

But this matches 13 through to the last 'd' of Bell_Road. Which is almost everything. See regex101 example

This matches two commas but not the third? I want it to match up to the next comma. But do it three times :)

Dave
  • 687
  • 7
  • 15
  • 1
    But you would have incomplete matches then, like `13 Every_Street`. Try `\d+\s+[A-Z][a-z][^,]*,\s+\S+`, see https://regex101.com/r/JdOb6e/2 – Wiktor Stribiżew Dec 05 '22 at 20:06
  • 1
    Maybe `(\d+ \S+_\S+, \S+)` ? https://regex101.com/r/rcVH7q/1 – Andrej Kesely Dec 05 '22 at 20:11
  • I like Andrej's example for its simplicity. The way I see it is his regex layout imitates the structure of the search. Building in spaces and leaving most of it down to visible characters. I haven't used `\S` before. – Dave Dec 05 '22 at 22:41
  • I appreciate Wiktor is finishing off the regex that I started. But I find this part of his regex `[^,]*,\s+\S+` hard to put into words :) – Dave Dec 05 '22 at 22:46

2 Answers2

1

You don't have to assert the comma to the right if you also want to match it.

If you want to match an underscore before the comma, and the address part itself can not contain a comma:

\b\d+\s+[A-Z][a-z][^_,]*_[^,]+,\s+\S+

Explanation

  • \b A word boundary
  • \d+ Match 1+ digits
  • \s+ Match 1+ whitespace chars
  • [A-Z][a-z] match an uppercase char A-Z and a lowercase char a-z
  • [^_,]*_ Optionally match any char except _ or , and then match _
  • [^,]*, Match optional chars except , and then match ,
  • \s+\S+ Match 1+ whitespace chars followed by 1+ non whitespace chars

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

This produces your desired matches:
\d+[^,\d]*_[^,]+, \S+
demo

They don't end with a comma, tho.
For that you could just remove \S+ at the end.

Marty
  • 974
  • 9
  • 25