1

I am trying to parse the "name" and "address" from a string. I have written the regex pattern which works perfectly fine (I verified in regex101.com) with the 'ungreedy/lazy' feature of regex but not with the greedy. Here is my snowflake query:

select
TRIM(REGEXP_SUBSTR(column1,'(^\\D*)((\\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\\d+.*)|(\\d+.*))[,.\\s]+([a-zA-Z]{2})[,.\\s]+(\\d{5}|\\d{5}-\\d{4})$',1,1,'is',1)) as test
from values(TRIM('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789')); 

--please ignore the latter part of regex as I am fetching territory code and zip code also and they are working fine.

The above query is returning me "FIRST SECOND THIRD PO BOX" And, if I return the 2nd group it returns me "123 DUMMY"

What I want:

case 1 - when my string is 'FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'

output of 1st group: "FIRST SECOND THIRD"

output of 2nd group: "PO BOX 123 DUMMY"

case 2 - WHEN my string is 'FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789'

output of 1st group: "FIRST SECOND THIRD FOURTH FIFTH"

output of 2nd group: "123 DUMMY"

Please suggest workaround here in snowflake since it doesn't have lazy feature.

PS. If you want to verify in regex101, paste the below code and test string. You will see the result when you switch to Ungreedy. (^\D*)((\bP[OST][ .]O[FFICE][ .]B[OX][ .]\d+.)|(\d+.))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$

Test String: FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789

Thanks

eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • If your Group 1 data cannot contain `P`, you could use `^([^0-9P]*)(\d.*|P[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\d.*)[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$` – Wiktor Stribiżew Apr 27 '21 at 19:08
  • Thanks for looking into this Wiktor. Unfortunately, I cannot restrict character P in the 1st group. As I mentioned I am trying to parse Name and Address fields here, the 1st group which is Name can contain the Character P of "Post Office Box". – pratyush kumar Apr 28 '21 at 05:14

1 Answers1

2

Writing a JavaScrip UDF is always an option, and then you can use your regex unchanged:

create or replace function parse_address(F STRING)
    returns VARIANT
    language JAVASCRIPT
    immutable
    as $$
    const regex = /(^\D*)((\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\d+.*)|(\d+.*))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$/gm;
    let m = regex.exec(F);
    return [m[1], m[2]];
$$;

Usage:

select parse_address($1)
from values('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789')
    , ('FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789')
;

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Hi Felipe, thanks much for your response. This is not the expected response. PO BOX is coming in the name instead of address. I am very new to Javascript and struggling a lot to write my required Javascript UDTF. Could you please write a sample for me where I need to execute a SQL query and return the result in a TABLE. I already tried reading lots of documentations and tried with "processRow: function get_params(row, rowWriter, context)", I am able to return TABLE result but not sure how to execute a select query along with this. If possible please write a sample for me. Thanks much. Cheers! – pratyush kumar Apr 30 '21 at 10:31
  • Hi Pratush - if the results are not correct, that's on the regex - I used the exact one you posted. To use the results as a table, in SQL you can ask for elements of the array, like in `parse_address($1)[1]`. The best news with JavaScript: Even if you can't come up with a regex that will solve all problems, you can also write custom code to split the addresses (which gives you more power than a regex). To make the regex in JS ungreedy, check https://stackoverflow.com/questions/364014/is-there-a-way-to-use-ungreedy-matching-in-javascript-for-regular-expressions – Felipe Hoffa Apr 30 '21 at 17:04