5

The ghtorrent-bq data is great to have snapshot of GitHub, however, it is not clear when it is updated and how I could get more up to date data

Steren
  • 7,311
  • 3
  • 31
  • 51

2 Answers2

4

Theoretically, it is updated every time a new GHTorrent MySQL dump has been released. Practically, there are still manual adjustments that need to be done to the generated CSVs as there is lots of weird text in fields such as user locations that CSV parsers fail to handle.

http://ghtorrent.org/gcloud.html

Georgios Gousios
  • 2,405
  • 1
  • 24
  • 34
3

(related to https://stackoverflow.com/a/42930963/132438)

GHTorrent only provides a periodical snapshot of their data on BigQuery, while GitHub Archive updates daily (or even hourly - let me check that).

It would be great to have a more frequent snapshot of GHTorrent (maybe https://twitter.com/gousiosg can help), but in the meantime you can merge both datasets (look for the GHTorrent snapshot data, and then add the latest stars from GitHub Archive):

#standardSQL
SELECT COUNT(DISTINCT login) c
FROM (
  SELECT login
  FROM (
    SELECT login
    FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
    JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
    ON a.repo_id=b.id
    JOIN `ghtorrent-bq.ght_2017_01_19.users` c
    ON a.user_id=c.id
    WHERE url = 'https://api.github.com/repos/angular/angular'
  )
  UNION ALL (
    SELECT actor.login
    FROM `githubarchive.month.2017*` 
    WHERE repo.name='angular/angular'
    AND type = "WatchEvent"
  )
)
Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325