2

My goal is to track over time the popularity of my BigQuery repo.

I want to use publicly available BigQuery datasets, like GitHub Archive or the GitHub dataset

The GitHub dataset sample_repos does not contain a snapshot of the star counts:

SELECT
  watch_count
FROM
  [bigquery-public-data:github_repos.sample_repos]
WHERE
  repo_name == 'angular/angular'

returns 5318.

GitHub Archive is a timeline of event. I can try to sum them all, but the numbers do not match with the numbers in the GitHub UI. I guess because it does not count unstar actions. Here is the query I used:

SELECT
  COUNT(*)
FROM
  [githubarchive:year.2011],
  [githubarchive:year.2012],
  [githubarchive:year.2013],
  [githubarchive:year.2014],
  [githubarchive:year.2015],
  [githubarchive:year.2016],
  TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2017-01-01'), TIMESTAMP('2017-03-30') )
WHERE
  repo.name == 'angular/angular'
  AND type = "WatchEvent"

returns 24144

The real value is 21,921

Steren
  • 7,311
  • 3
  • 31
  • 51

1 Answers1

6
#standardSQL
SELECT 
  COUNT(*) naive_count,
  COUNT(DISTINCT actor.id) unique_by_actor_id, 
  COUNT(DISTINCT actor.login) unique_by_actor_login 
FROM `githubarchive.month.*` 
WHERE repo.name = 'angular/angular'
AND type = "WatchEvent"

enter image description here

Naive count: Some people star and un-star, and star again. This creates duplicate WatchEvents.

Unique by actor id count: Each person can only star once. We can count those (but we don't know if they un-starred, so the total count will be lower than this).

Unique by actor login: Some historical months are missing the 'actor.id' field. We can look at the 'actor.login' field instead (but some people change their logins).

Alternatively, thanks to GHTorrent project:

#standardSQL
SELECT COUNT(*) stars
FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
ON a.repo_id=b.id
WHERE url = 'https://api.github.com/repos/angular/angular'
LIMIT 10

20567, as of 2017/01/19.


Related:

  • What happens when a project changes it's name?

https://stackoverflow.com/a/42935592/132438

  • How to get updated GHtorrent data, before they update it?

https://stackoverflow.com/a/42935662/132438

Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • The githubarchive approach seems to be a good approximate for `angular/angular`, but when I try on `bazelbuild/bazel`, I actually get 3320 (against 6033 in reality). – Steren Mar 21 '17 at 15:33
  • The `ghtorrent-bq` approach is much closer. But I want latest data, is the dataset updated daily? weekly? – Steren Mar 21 '17 at 15:40
  • Interesting comments, I added 2 additional answers (consider submitting new questions to stack overflow too, as an option to making this a too long dialog) – Felipe Hoffa Mar 21 '17 at 18:26
  • 1
    Subsequent questions asked here: http://stackoverflow.com/questions/42935551/bigquery-github-data-how-to-handle-repo-name-changes and here: http://stackoverflow.com/questions/42935607/bigquery-when-is-ghtorrent-refreshed-and-how-to-get-up-to-date-information – Steren Mar 21 '17 at 18:44