1

I'm trying to write a regular expression for parsing address.

My address strings have next format: AddressLine (City) SomeFixedText

The only problem that (City) part is optional. For example:

Abingdon Road (Oxford) Water Network Pumping
Adderbury Water Network Pumping

Without capturing optional group I've come up with following expression:

(?<Line>.*) \((?<City>.*)\) Water Network Pumping

But of course it doesn't not handle the second case. I tried to use an optional group:

(?<Line>.*)( \((?<City>.*)\))? Water Network Pumping

But in this case Line and City are captured incorrectly for the first case.

Here is the fiddle: https://dotnetfiddle.net/crHlta

How can I handle this situation?

Roman Koliada
  • 4,286
  • 2
  • 30
  • 59
  • 1
    Use `(?.*?)(?:\s+\((?.*?)\))? Water Network Pumping` – Wiktor Stribiżew Feb 06 '20 at 12:01
  • @WiktorStribiżew I disagree with the decision to close this question on the grounds that a bit of targeted advice for the OP would better help him learn what went wrong – Caius Jard Feb 06 '20 at 12:03
  • @RomanKolida in essence the problem (and the tiny change of `.*` -> `.*?`) is that .* matches all of the input including the city name in brackets. THen your optional clause says "is there a city left in the data, in brackets?" - even when there is it will decide there isn't because the greedy consumption by `.*` has consumed the city too. Because city is allowed to be absent, the match succeed, even though the city was captured into the wrong thing. By making the matching creep forwards (pessimistic) rather than greedy (consume all and work backwords), `.*` matches upto the first bracket – Caius Jard Feb 06 '20 at 12:06
  • You could also manage it by making the first quantifier "match everything that is not a bracket" like `(?[()]*)(?:\s+\((?.*?)\))? Water Network Pumping` - ultimately if you're going to have "optional things that follow mandatory things" you have to make sure that the "mandatory thing" doesn't match too much and eat up all your optional things – Caius Jard Feb 06 '20 at 12:07
  • You could use this: ((?.*?)\((?.*?)\)\sWater Network Pumping|(?.*?)\sWater Network Pumping) – Jeppe Spanggaard Feb 06 '20 at 12:09
  • The only problem here is that `.*` was used instead of `.*?`. Hence, the duplicates are appropriate. – Wiktor Stribiżew Feb 06 '20 at 12:15

0 Answers0