0

I want to turn a JSON encoded list into a native BigQuery Array, ideally this would end up being a list of tuples or dictionaries with value, position entries. Hence the reference to the python enumerate functionality.

i.e.

[(idx, elem) for idx, elem in enumerate(json_list_string)]

[{'pos':idx, 'value':elem} for idx, elem in enumerate(json_list_string)]

The first part of turning the json into an array I already solved using this question

Edit:

WITH
  my_ids AS (
  SELECT 'xyz' as grp, '["7f9f98fh9g4ef393d3h5", "chg3g33f26949hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' as ids
  UNION ALL
  SELECT 'abc' as grp, '["7f9f98fh9g4ef393d3h5", "chg3g33fdsfsdfs49hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' as ids
  )

SELECT
  *
FROM my_ids

In an ideal world I would get an output like:

xyz, 7f9f98fh9g4ef393d3h5, 1
xyz, chg3g33f26949hg6067d, 2
...
abc, 946637g145h48322686f, 6

Please note, that the lists can be rather long (up to 24 entries and I kinda don't want to hardcode all the paths)

Edit2: (Possible Solution)

WITH
  my_ids AS (
  SELECT 'xyz' as grp, '["7f9f98fh9g4ef393d3h5", "chg3g33f26949hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' as ids
  UNION ALL
  SELECT 'abc' as grp, '["7f9f98fh9g4ef393d3h5", "chg3g33fdsfsdfs49hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' as ids
  ),

  as_list AS (SELECT
    *,
    SPLIT(REGEXP_REPLACE(JSON_EXTRACT(ids,'$'), r'[\[\]\"]', ''), ',') AS split_items,
    GENERATE_ARRAY(1, ARRAY_LENGTH(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(ids,'$'), r'[\[\]\"]', ''), ','))) AS positions
  FROM my_ids)

SELECT grp, ids, positions[OFFSET(off)] as pos
FROM as_list, unnest(split_items) as ids WITH OFFSET off
Fabian Bosler
  • 2,310
  • 2
  • 29
  • 49

1 Answers1

2

Below is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.my_ids` AS (
  SELECT 'xyz' AS grp, '["7f9f98fh9g4ef393d3h5", "chg3g33f26949hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' AS ids UNION ALL
  SELECT 'abc' AS grp, '["7f9f98fh9g4ef393d3h5", "chg3g33fdsfsdfs49hg6067d", "g477e5973ec04g7c3232", "0de1ec83304d761he786", "3c1h1f153530g90g35c2", "946637g145h48322686f"]' AS ids
)
SELECT grp, id, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY OFFSET) pos
FROM `project.dataset.my_ids`, 
UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(ids,'$'), r'[\[\]\"]', ''), ',')) id WITH OFFSET
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • indeed much cleaner, funnily enough I did have a version like this before moving to offset. I kept running into Resource Exceeded Errors when applying this logic to a ~60GB. There is another window function in my pipeline, which is also causing Resource Exceeded problems right now. So I can't say for certain. Can you say anything about how the two approaches compare in terms of resource usage? – Fabian Bosler Jun 12 '19 at 13:41
  • Any idea how I can overcome the resource exceeded with the over function? I remember in the back of my head that there is something about double partitioning. – Fabian Bosler Jun 12 '19 at 14:00
  • It is quite not possible to answer this in comments - I would suggested you to post new question with specific use case and all relevant details so we can help you – Mikhail Berlyant Jun 12 '19 at 14:02
  • wouldn't the following be cleaner? I will shortly write the followup question. SELECT grp, id, pos + 1 as pos FROM `project.dataset.my_ids`, UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(ids,'$'), r'[\[\]\"]', ''), ',')) id WITH OFFSET as pos or just accept, that the indexing is zero-based, in essence the "WITH OFFSET" was what I was initially asking for!! – Fabian Bosler Jun 12 '19 at 14:48
  • this is the standalone question btw: https://stackoverflow.com/questions/56565523/query-failed-with-error-resource-exhausted-order-by-without-order-by – Fabian Bosler Jun 12 '19 at 15:17