3

I have a table in bigquery where I have object and for each object I have some stringified json. In json, an example row look like below:

{
    "ObjectID": "1984931229",
    "indexed_abstract": "{\"IndexLength\":123,\"InvertedIndex\":{\"Twenty-seven\":[0],\"metastatic\":[1,45],\"breast\":[2],\"adenocarcinoma\":[3],\"patients,\":[4]}}" 
}

where inside the indexed_abstract we have an InvertedIndex which contains some keywords and how many times these keywords appeared in the ObjectID.

Now I want to access the stringified json by parsing the json using bigquery and for each ObjectID I want to create a nested field where I have the keyword, the corresponding array and the length of the corresponding array.

For example in this case the output would look like the following:

+------------+----------------+---------------+-------------------+
|  ObjectID  |  keyword.key   | keyword.count | keyword.positions |
+------------+----------------+---------------+-------------------+
| 1984931229 | Twenty-seven   |             1 | [0]               |
|            | metastatic     |             2 | [1,45]            |
|            | breast         |             1 | [2]               |
|            | adenocarcinoma |             1 | [3]               |
|            | patients       |             1 | [4]               |
+------------+----------------+---------------+-------------------+

I understand I could use JSON_EXTRACT function but I am not sure what would be my key inside the inverted index to access the keywords and the arrays corresponding to them.

Syed Arefinul Haque
  • 1,123
  • 2
  • 14
  • 38

1 Answers1

4

Below is for BigQuery Standard SQL

#standardSQL
SELECT ObjectID, 
  ARRAY(
    SELECT AS STRUCT 
      key, 
      ARRAY_LENGTH(SPLIT(value)) `count`, 
      value positions 
    FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(indexed_abstract, '$.InvertedIndex'), r'"[^"]+":\[[\d,]*?]')) pair,
    UNNEST([STRUCT(REPLACE(SPLIT(pair, ':')[OFFSET(0)], '"', '') AS key, SPLIT(pair, ':')[OFFSET(1)] AS value)])
  ) keyword
FROM `project.dataset.table`

If to apply to sample data from your question - result is

Row ObjectID    keyword.key     keyword.count   keyword.positions    
1   1984931229  Twenty-seven    1               [0]  
                metastatic      2               [1,45]   
                breast          1               [2]  
                adenocarcinoma  1               [3]  
                patients        1               [4]  

Update on Op's comment - I was wondering if I wanted to make the positions an array (a repeated field), how would I do that?

Change needs to be done in just one line

  SPLIT(REGEXP_REPLACE(value, r'\[|]', '')) positions 
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Wow, I that is complicated! Thanks though. I was wondering if I wanted to make the positions an array (a repeated field), how would I do that? – Syed Arefinul Haque Jul 20 '19 at 16:19
  • btw, i read your comment again and your extra ask is pretty simple - so I will just update my answer - but i am on go right now - so it will be later in a day :o) – Mikhail Berlyant Jul 20 '19 at 16:24
  • 1
    actually it is quite simple and not much to type so I just added it to my answer now – Mikhail Berlyant Jul 20 '19 at 16:29
  • I also tried your answer from here just now, that also work too! https://stackoverflow.com/questions/46199823/how-to-convert-stringified-array-into-array-in-bigquery . Thanks! – Syed Arefinul Haque Jul 20 '19 at 16:34