4

I am exploring the power of Google Biguery with the GDELT database using this tutorial however the sql dialect is in 'legacy' and I would like to use the standard dialect.

In legacy dialect:

SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300

and when I try to translate into standard dialect:

SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
    FROM
      `gdelt-bq.gdeltv2.gkg`
    WHERE
      DATE>20150302000000
      AND DATE < 20150304000000
      AND V2Persons LIKE '%Netanyahu%' )
  GROUP BY
    theme
  ORDER BY
    2 DESC
  LIMIT
    300

it throws the following error:

No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]

it seems like I have to cast the result of the SPLIT() operation as a string. How do I do this?

UPDATE: I found a talk explaining the unnest operation:

SELECT
  COUNT(*),
  REGEXP_REPLACE(themes,",.*","") AS theme
FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`,
  UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
  _PARTITIONTIME >= "2018-08-09 00:00:00"
  AND _PARTITIONTIME < "2018-08-10 00:00:00"
  AND V2Persons LIKE '%Netanyahu%'
GROUP BY
  theme
ORDER BY
  2 DESC
LIMIT
  100
blong
  • 2,815
  • 8
  • 44
  • 110
Rutger Hofste
  • 4,073
  • 3
  • 33
  • 44

1 Answers1

5

Flatten the array first:

SELECT
  REGEXP_REPLACE(theme , r',.*', " ") AS theme,
  COUNT(*) AS count
FROM
  `gdelt-bq.gdeltv2.gkg`,
  UNNEST(SPLIT(V2Themes,';')) AS theme
WHERE
  DATE>20150302000000
  AND DATE < 20150304000000
  AND V2Persons LIKE '%Netanyahu%' 
GROUP BY
  theme
ORDER BY
  2 DESC
LIMIT
  300

The legacy SQL equivalent in your question actually has the effect of flattening the array as well, although it's implicit in the GROUP BY on the theme.

Martin Weitzmann
  • 4,430
  • 10
  • 19
Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • the code does not work "as is". I just copied it in the query editor and get the error: "Values referenced in UNNEST must be arrays. UNNEST contains expression of type STRING at [6:10]" – Rutger Hofste Aug 09 '18 at 14:27
  • 1
    I edited it and moved the split, which creates the array, to the cross join – Martin Weitzmann Aug 09 '18 at 14:35
  • 1
    Ah, right, I didn't read closely enough. I was focused on the `V2Themes`, which I was thinking was an array. Thanks for the edit! – Elliott Brossard Aug 09 '18 at 14:54