2

I have the following string:

3#White House, District Of Columbia, United States#US#USDC#DC001#38.8951#-77.0364#531871#382

as you can see, the string is delimited by #'s. My use-case resembles a simple SPLIT(string,"#") operation but regex gives me a bit more flexibility.

I would like to match the characters between two occurrences of #'s. for example the characters between the second and third occurrence should match: 'US'

I'm using Google Bigquery and was able to match the first two terms of the string but struggle with the third:

REGEXP_EXTRACT(locations,r'^\d') as location_type,    
REGEXP_REPLACE(REGEXP_EXTRACT(locations,r'^\d#.*?#'),r'^\d*#|#','') as location_full_name, 
????

locations are strings such as the one above.

I've found this question but I have multiple delimeters and would like to specify between which occurences the match should take place e.g. 2 and 5th occurrence.

Rutger Hofste
  • 4,073
  • 3
  • 33
  • 44

2 Answers2

3

You may use a regex like ^(?:[^#]*#){N}([^#]*) where N is the number of your required substring minus 1. To get US, which is the third value, you may use

^(?:[^#]*#){2}([^#]*)

See the regex demo

Details

  • ^ - start of string
  • (?:[^#]*#){2} - two sequences of
    • [^#]* - any zero or more chars other than #
    • # - a # char
  • ([^#]*) - Capturing group 1: any zero or more chars other than #.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • wow, this is very powerful. How would you, using this regex, match between the 2nd and 5th occurrence? Can probably figure out the rest myself. – Rutger Hofste Aug 14 '18 at 13:23
  • 1
    @RutgerHofste If you want to match them inclusively, [`^(?:[^#]*#)([^#]*(?:#[^#]*){3})`](https://regex101.com/r/G5J5wg/2) – Wiktor Stribiżew Aug 14 '18 at 13:24
  • The expected result would be "US#USDC#DC001#38.8951" (all characters including delimiter between the 2nd and 5th occurrence of the delimiter) – Rutger Hofste Aug 14 '18 at 13:29
  • 1
    @RutgerHofste Then use `{2}` limiting quantifier after the non-capturing group, [`^(?:[^#]*#){2}([^#]*(?:#[^#]*){3})`](https://regex101.com/r/G5J5wg/3) – Wiktor Stribiżew Aug 14 '18 at 13:31
1

My use-case resembles a simple SPLIT(string,"#") operation but regex gives me a bit more flexibility

Obviously REGEXP_EXTRACT() is the way to go here - but wanted to throw different option to show flexibility in using split too - just one of an option

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '3#White House, District Of Columbia, United States#US#USDC#DC001#38.8951#-77.0364#531871#382' locations
)
SELECT 
  REGEXP_EXTRACT(locations, r'^(?:[^#]*#){2}([^#]*(?:#[^#]*){3})') value_via_regexp,
  (SELECT STRING_AGG(part, '#' ORDER BY pos) FROM UNNEST(SPLIT(locations, '#')) part WITH OFFSET pos WHERE pos BETWEEN 2 AND 5) value_via_split_unnest
FROM `project.dataset.table`      

with result as

Row     value_via_regexp            value_via_split_unnest   
1       US#USDC#DC001#38.8951       US#USDC#DC001#38.8951    
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Hi i had something of a similar conundrum. I am using a BQ query and my dataset has the info in 1 of the columns like record 1 : "A.B.C.D" Record 2 : "P.Q.R" Record 3 : "X.Y" Record 4 : "L.M.N" and i have to get the values except the dots in separate columns. Note that there cud be varying number of dots inside 1 column. Struggling with this to do it in 1 shot in a query. – Nik M. Apr 24 '20 at 22:02