I'm wanting to use GDELT to obtain a CSV containing a count of the number of articles containing a specific theme for all countries for a given number of years. Please note it's not quite the same as the events database, I'm specifically interested in the themes). There are some tutorials on the GDELT website (here), but they appear to be a little out of date regarding the regexp syntax and I'm not super familiar with SQL. Ideally, my output should look something like the following :
Year Country Count
2001 Afghanistan 34234
2002 Afghanistan 11864
...
2001 Zambia 939
2002 Zambia 864
From my understanding and the tutorial, this code counts the number of articles by theme and day of the year rather than year (taken from here). It's close to what I want, but not quite there.
select date(_partitiontime) date, count(theme) occurences
from `gdelt-bq.gdeltv2.gkg_partitioned`, unnest(split(themes,';')) as theme
where _partitiontime >= "2020-11-01 00:00:00" and _partitiontime < "2020-11-07 00:00:00"
and lower(theme) like "%bitcoin%"
group by date
-- order by date
I think I need to
- Add something to return and group by V2Locations at the country level
- Parse the data to get only the year
but am not sure how to do it. Any help would be greatly appreciated. Thanks!