Regular Expression Match between occurrence of character

Question

I have the following string:

3#White House, District Of Columbia, United States#US#USDC#DC001#38.8951#-77.0364#531871#382

as you can see, the string is delimited by #'s. My use-case resembles a simple SPLIT(string,"#") operation but regex gives me a bit more flexibility.

I would like to match the characters between two occurrences of #'s. for example the characters between the second and third occurrence should match: 'US'

I'm using Google Bigquery and was able to match the first two terms of the string but struggle with the third:

REGEXP_EXTRACT(locations,r'^\d') as location_type,    
REGEXP_REPLACE(REGEXP_EXTRACT(locations,r'^\d#.*?#'),r'^\d*#|#','') as location_full_name, 
????

locations are strings such as the one above.

I've found this question but I have multiple delimeters and would like to specify between which occurences the match should take place e.g. 2 and 5th occurrence.

score 3 · Accepted Answer · answered Aug 14 '18 at 13:19

3

You may use a regex like ^(?:[^#]*#){N}([^#]*) where N is the number of your required substring minus 1. To get US, which is the third value, you may use

^(?:[^#]*#){2}([^#]*)

See the regex demo

Details

^ - start of string
(?:[^#]*#){2} - two sequences of
- [^#]* - any zero or more chars other than #
- # - a # char
([^#]*) - Capturing group 1: any zero or more chars other than #.

answered Aug 14 '18 at 13:19

Wiktor Stribiżew

607,720
39
448
563

wow, this is very powerful. How would you, using this regex, match between the 2nd and 5th occurrence? Can probably figure out the rest myself. – Rutger Hofste Aug 14 '18 at 13:23
1

@RutgerHofste If you want to match them inclusively, [`^(?:[^#]*#)([^#]*(?:#[^#]*){3})`](https://regex101.com/r/G5J5wg/2) – Wiktor Stribiżew Aug 14 '18 at 13:24
The expected result would be "US#USDC#DC001#38.8951" (all characters including delimiter between the 2nd and 5th occurrence of the delimiter) – Rutger Hofste Aug 14 '18 at 13:29
1

@RutgerHofste Then use `{2}` limiting quantifier after the non-capturing group, [`^(?:[^#]*#){2}([^#]*(?:#[^#]*){3})`](https://regex101.com/r/G5J5wg/3) – Wiktor Stribiżew Aug 14 '18 at 13:31

score 1 · Answer 2 · answered Aug 14 '18 at 15:14

My use-case resembles a simple SPLIT(string,"#") operation but regex gives me a bit more flexibility

Obviously REGEXP_EXTRACT() is the way to go here - but wanted to throw different option to show flexibility in using split too - just one of an option

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '3#White House, District Of Columbia, United States#US#USDC#DC001#38.8951#-77.0364#531871#382' locations
)
SELECT 
  REGEXP_EXTRACT(locations, r'^(?:[^#]*#){2}([^#]*(?:#[^#]*){3})') value_via_regexp,
  (SELECT STRING_AGG(part, '#' ORDER BY pos) FROM UNNEST(SPLIT(locations, '#')) part WITH OFFSET pos WHERE pos BETWEEN 2 AND 5) value_via_split_unnest
FROM `project.dataset.table`

with result as

Row     value_via_regexp            value_via_split_unnest   
1       US#USDC#DC001#38.8951       US#USDC#DC001#38.8951

Hi i had something of a similar conundrum. I am using a BQ query and my dataset has the info in 1 of the columns like record 1 : "A.B.C.D" Record 2 : "P.Q.R" Record 3 : "X.Y" Record 4 : "L.M.N" and i have to get the values except the dots in separate columns. Note that there cud be varying number of dots inside 1 column. Struggling with this to do it in 1 shot in a query. — Nik M., Apr 24 '20 at 22:02

Regular Expression Match between occurrence of character

2 Answers2

Linked

Related