2

(This is an extension to this question, but my reputation is too low to comment or ask more questions on that topic...)

We work on bigquery, hence limited in importing packages or using other languages. And, as per the link above, js is a solution, but not what I'm looking for here. I implemented it in js, and it was too slow for our needs.

Suppose one of our columns is a string that look like this (array of json):
[{"location":[22.99902,66.000],"t":1},{"location":[55.32168,140.556],"t":2},{"location":[85.0002,20.0055],"t":3}]

I want to extract from the column the json for which "t":2

Where:

  • some columns don't have elements "t":2
  • Some columns have several elements "t":2
  • The number of json elements in each string can change
  • element "t":2 is not always in second position.

I don't know regexp well enough for this. We tried regexp_extract with this pattern: r'(\{.*?\"t\":2.*?\})')), but that doesn't work. It extracts everything that precedes "t":2, including the json for "t":2. We only want the json of element "t":2.

Could you advise a regexp pattern that would work?

EDIT:

I have a preference for a solution that gives me 1 match. Suppose I have this string: [{"location":[22.99902,66.000],"t":1},{"location":[55.32168,140.556],"t":2},{"location":[55.33,141.785],"t":2}],
I would prefer receiving only 1 answer, the first one.

In that case perhaps regexp is less appropriate, but I'm really not sure?

Jimmy G
  • 23
  • 4

4 Answers4

1

How about this:

(?<=\{)(?=.*?\"t\"\s*:\s*2).*?(?=\})

As seen here

Robo Mop
  • 3,485
  • 1
  • 10
  • 23
  • That works well it seems with the example above. For some strange reason, doesn't work as well on my data, which is a bit more complex. Work in progress. I upvoted your question, but low reputation ... One more question: any way you can return the 1st instance found? Not just all of them? Thanks – Jimmy G Feb 16 '20 at 15:58
0

May 1st, 2020 Update

A new function, JSON_EXTRACT_ARRAY, has been just added to the list of JSON functions. This function allows you to extract the contents of a JSON document as a string array.

so in below you can replace use of json2array UDF with just in-built function JSON_EXTRACT_ARRAY as in below example

#standardSQL
SELECT id,  
  (
    SELECT x
    FROM UNNEST(JSON_EXTRACT_ARRAY(json, '$')) x
    WHERE JSON_EXTRACT_SCALAR(x, '$.t') = '2'
  ) extracted
FROM `project.dataset.table`  

==============

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
"""; 
SELECT id,  
  (
    SELECT x
    FROM UNNEST(json2array(JSON_EXTRACT(json, '$'))) x
    WHERE JSON_EXTRACT_SCALAR(x, '$.t') = '2'
  ) extracted
FROM `project.dataset.table`  

You can test, play with above using dummy data as in below example

#standardSQL
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  return JSON.parse(json).map(x=>JSON.stringify(x));
"""; 
WITH `project.dataset.table` AS (
  SELECT 1 id, '[{"location":[22.99902,66.000],"t":1},{"location":[55.32168,140.556],"t":2},{"location":[85.0002,20.0055],"t":3}]' json UNION ALL
  SELECT 2, '[{"location":[22.99902,66.000],"t":11},{"location":[85.0002,20.0055],"t":13}]'
)
SELECT id,  
  (
    SELECT x
    FROM UNNEST(json2array(JSON_EXTRACT(json, '$'))) x
    WHERE JSON_EXTRACT_SCALAR(x, '$.t') = '2'
  ) extracted
FROM `project.dataset.table`

with output

Row id  extracted    
1   1   {"location":[55.32168,140.556],"t":2}    
2   2   null     

Above assumes that there is no more than one element with "t":2 in json column. In case if there can be more than one - you should add ARRAY as below

SELECT id,  
  ARRAY(
    SELECT x
    FROM UNNEST(json2array(JSON_EXTRACT(json, '$'))) x
    WHERE JSON_EXTRACT_SCALAR(x, '$.t') = '2'
  ) extracted
FROM `project.dataset.table`
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • That's a working solution, but as I said in my question, I am trying to move awat from js as it takes too much time, that's why I'm looking for a regexp – Jimmy G Feb 16 '20 at 15:51
0

There is another solution but it is not regexp based (as I had originally asked). So this should not count as the final answer to my own question, nonetheless could be useful.
It is based on a split of the string in array and then chosing the element in the array that satisfies my needs.

Steps:

  1. transform the string into something better for splits (using '|' as seperator):
    replace(replace(replace(my_field,'},{','}|{'),'[{','{'),'}]','}')
  2. split it using split(), which yields an array of strings (each one a json element)
  3. find the relevant element ("t":2) - in my case, the first one is good enough, so I limit the query to 1: array( select data from unnest(split(replace(replace(replace(my_field,'},{','}|{'),'[{','{'),'}]','}'),'|')) as data where data like '%"t":2%' limit 1)
  4. Convert that into a useable string with array_to_string() and use json_extract on that string to extract the relevant info from the element that I need (say for example, location coordinate x).

So putting it all together:
round(safe_cast(json_extract(array_to_string(array( select data from unnest(split(replace(replace(replace(my_field,'},{','}|{'),'[{','{'),'}]','}'),'|')) as data where data like '%"t":2%' limit 1),''),'$.location[0]') as float64),3) loc_x

Jimmy G
  • 23
  • 4
0

Even though, you have posted a work around your issue. I believe this answer will be informative. You mentioned that one of the answer selected more than what you needed, I wrote the query below to reproduce your case and achieve aimed output.

  WITH
  data AS (
  SELECT
    " [{ \"location\":[22.99902,66.000]\"t\":1},{\"location\":[55.32168,140.556],\"t\":2},{\"location\":[85.0002,20.0055],\"t\":3}] " AS string_j
  UNION ALL
  SELECT
    " [{ \"location\":[22.99902,66.000]\"t\":1},{\"location\":[55.32168,140.556],\"t\":3},{\"location\":[85.0002,20.0055],\"t\":3}] " AS string_j
  UNION ALL
  SELECT
    " [{ \"location\":[22.99902,66.000]\"t\":1},{\"location\":[55.32168,140.556],\"t\":3},{\"location\":[85.0002,20.0055],\"t\":3}] " AS string_j
  UNION ALL
  SELECT
    " [{ \"location\":[22.99902,66.000]\"t\":1},{\"location\":[55.32168,140.556],\"t\":3},{\"location\":[85.0002,20.0055],\"t\":3}] " AS string_j ),
  refined_data AS (
  SELECT
    REGEXP_EXTRACT(string_j, r"\{\"\w*\"\:\[\d*\.\d*\,\d*\.\d*\]\,\"t\"\:2\}") AS desired_field
  FROM
    data )
  SELECT
  *
  FROM
  refined_data
  WHERE
  desired_field IS NOT NULL

Notice that I have used the dummy described in the temp table, populated inside the WITH method. As below:

enter image description here

Afterwords, in the table refined_data, I used the REGEXP_EXTRACT to extract the desired string from the column. Observe that for the rows which there is not a match expression, the output is null. Thus, the table refined_data is as follows :

enter image description here

As you can see, now it is just needed a simple WHERE filter to obtain the desired output, which was done in the last select.

In addition you can see the information about the regex expression I provided here.

halfer
  • 19,824
  • 17
  • 99
  • 186
Alexandre Moraes
  • 3,892
  • 1
  • 6
  • 13