18

My regex needs to parse an address which looks like this:

BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
          1            2       3      4*

Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?

(.*?)\s(\d{5})\s(.*)$

(I'm going to be using this Oracles REGEXP function)

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
  • 1
    What exactly is allowed as content for the groups? May group 4 or 5 contain whitespace, for example? Is group 2 always 5 characters long? – Tim Pietzcker Jul 12 '11 at 12:25

5 Answers5

16

Change the regex to:

(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$

Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.

NorthGuard
  • 953
  • 1
  • 7
  • 21
12

To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!

A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.

Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

Luke Madhanga
  • 6,871
  • 2
  • 43
  • 47
2

Try this:

(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • 1
    `(.*)\s?(.*)?` -> `(.*)` will match all the way to the end, so `\s?(.*)?` will never match anything. The real problem here is ambiguity - how do you correctly split `New York New York`? (reading more carefully, there are only two options) – Kobi Jul 12 '11 at 12:24
  • I don't think there's a New York New York, in either Finland or Sumoi. But it's a valid point... it's impossible to know where a multi-word city and multi-word country name are separated if they're only space-delimited. But given we're told there are only two possible country names, and neither contains a space, I can update the RE accordingling. – Jonathan Hall Jul 12 '11 at 12:31
0

This will match your input more tightly and each of your groups is in its own regex group:

(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)

or if space is OK instead of "whitespace":

(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
  • Group 1: BLOOKKOKATU 20 A 773
  • Group 2: 00810
  • Group 3: HELSINKI
  • Group 4: SUOMI (optional - doesn't have to match)
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

(.*?)\s(\d{5})\s(\w+)\s(\w*)

An example:

   SQL> with t as
      2  ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
      3  )
      4  select text
      5       , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
      6    from t
      7  /


TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI


1 row selected.

Regards,
Rob.

Rob van Wijk
  • 17,555
  • 5
  • 39
  • 55