Regex (Posix) to get first word only, not including numbers

Question

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".

I would like to get the street name (first non-numeric word) from an address.

e.g.

101 Main Street = Main

2/b Pleasant Ave = Pleasant

5H Unpleasant Crescent = Unpleasant

I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).

Thanks

Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.

A plain English statement of the requirement would be:

Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters

Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.

Additional examples (original ones at top are still valid)

190 - 192 Tweety-bird avenue = Tweety-bird

190-192 Tweety-bird avenue = Tweety-bird

Charles Bronson Place = Charles

190H Charles-Bronson Place = Charles-Bronson

190 to 192 Charles Bronson Place = Charles

Second Edit: Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:

[a-zA-Z]+(?:[\s-][a-zA-Z]+)*

which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).

Sample SQL:

select HAD1,                                        
 regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH                                            
where HEDTE > 20190601

Sample output

Ship To                                             REGEXP_SUBSTR      
Address                                                                
Line 1                                                                 
32 CHRISTOPHER STREET                               CHRISTOPHER STREET 
250 - 270 FEATHERSTON STREET                        FEATHERSTON STREET 
118 MONTREAL STREET                                 MONTREAL STREET    
7 BIRMINGHAM STREET                                 BIRMINGHAM STREET  
59 MORRISON DRIVE                                   MORRISON DRIVE     
118 MONTREAL STREET                                 MONTREAL STREET    
MASON ROAD                                          MASON ROAD

I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.

The street's name might not always be one word though. Are you sure you just want the first word? — Sweeper, Jun 28 '19 at 13:08
True, but the first word will suffice in the vast majority of cases since I'll be combining it with other information to make a unique value. If I wanted the whole multi-word street name I presume I would need to know (and state) every possible street name type (avenue, close, promenade, etc.) to know when the street name had ended. — , Jun 28 '19 at 13:30

score 0 · Answer 1 · answered Jun 28 '19 at 14:51

0

Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.

The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.

This should also cater for street names which have white spaces in them.

answered Jun 28 '19 at 14:51

npinti

51,780
5
72
96

Alas, no - not compliant with my particular implementation. Thanks for trying though. – Jun 29 '19 at 00:03

score -1 · Answer 2 · answered Jun 28 '19 at 13:19

-1

Using the following regex matches your examples :

(?<=[^ ]+ )[^ ]*[ ]

answered Jun 28 '19 at 13:19

nAviD

2,784
1
33
54

1

Thank you for your reply, but that crashes the engine. Probably IBM has messed around with POSIX in this particular implementation. I can't help noticing they use the weasel phrase "based on". Removing ?<= does get a result: the number and name (i.e. the first two words), so perhaps we're on the right track. – Jun 28 '19 at 13:36
@Unmutual sorry to here that. Does it supports split? why not split(' ') and then get the index 1 of result? – nAviD Jun 28 '19 at 13:40
Not split, no. There are several string manipulators in DB2 SQL, and it's relatively simple to get the positions of the first and second blank and extract the string between them, but then it gets more complicated if, e.g. there is no street number. I thought this might be a good simple regex task to cut my teeth on. Perhaps I picked a bad first example. – Jun 28 '19 at 13:56

Regex (Posix) to get first word only, not including numbers

2 Answers2