0

The Github API doesn't provide pagination when searching for stats of a repository, so that the number of contributors is actually capped to 100. The workaround to get this information is by scraping the HTML element that contains the data.

I have found that the code below works fine to scrap the contributors count if the repo's URL was previously loaded in a browser (Chrome or Firefox), but it fails and returns an empty <span></span> element when (or if) the HTML page wasn't loaded first in a web browser.

This looks very strange to me so I added a User Element in a getURL() function to simulate a page load from my browser, but it doesn't change and the problem is still here.

To reproduce the problem I try scraping multiple repos, 14 in this example:

# Create a vector of repo's URLs
r <- GET("https://api.github.com/users/chartjs/repos")
c <- content(r)
urls <- unlist(lapply(c, "[[", "html_url"))

> urls
 [1] "https://github.com/chartjs/Chart.BarFunnel.js"        "https://github.com/chartjs/Chart.js"                 
 [3] "https://github.com/chartjs/Chart.LinearGauge.js"      "https://github.com/chartjs/Chart.smith.js"           
 [5] "https://github.com/chartjs/chartjs-chart-financial"   "https://github.com/chartjs/chartjs-color"            
 [7] "https://github.com/chartjs/chartjs-color-string"      "https://github.com/chartjs/chartjs-plugin-annotation"
 [9] "https://github.com/chartjs/chartjs-plugin-datalabels" "https://github.com/chartjs/chartjs-plugin-deferred"  
[11] "https://github.com/chartjs/chartjs-plugin-zoom"       "https://github.com/chartjs/chartjs.github.io"        
[13] "https://github.com/chartjs/gitbook-plugin-chartjs"    "https://github.com/chartjs/www.chartjs.org"

complete <- NULL

while( is.null(complete) ) {

  # Download HTML and select the span element
  html <- lapply(urls, read_html)
  nodes <- lapply(html, html_nodes, xpath="//*[@class='num text-emphasized']")

  # Convert to text and select the fourth span 
  text <- lapply(nodes, html_text)
  element <- lapply(text, magrittr::extract, 4)

  # Unlist and convert as numeric
  e <- as.numeric(unlist(element))
  print(e)
  if(!any(is.na(e))) complete <- TRUE

}

This is what I see when I execute the while(). The NAs correspond to the repositories where the server fails to reply with the data.

The first tree iterations fail for 4 out of 14 of the repos but as soon as I load the URL https://github.com/chartjs/www.chartjs.org in a web browser, which is the last repo, then the Github server reply with the data at the 4th iteration.

Finally, right before iteration n°6 I loaded another repo's page : the urls[13].

 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2  NA  NA
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2  NA  NA
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2  NA  NA
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2  NA   2 # urls[14] repo page loaded
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2  NA   2
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2   1   2 # urls[13] repo page loaded
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2   1   2
 [1]   3 231   3  NA   1  23  13  11   2  NA  11   2   1   2

Is it only me or one of you guys can replicate this issue ? is there a solution to this problem ?

Thanks,

Florent
  • 1,791
  • 22
  • 40
  • 3
    You don't want to webscrape this, you want to use the proper REST API that GitHub provides. You'll get proper JSON data. – Dirk Eddelbuettel Oct 30 '17 at 20:39
  • 1
    The Github API actually cap the contributors count to 100. https://stackoverflow.com/questions/18148490/how-can-i-get-more-than-100-results-from-github-api-v3-using-github-api-gem – Florent Oct 30 '17 at 20:58
  • For some reason the code in my above example works right now but it fails with another repo, for example "https://github.com/leishman/kraken_ruby". – Florent Oct 30 '17 at 21:03
  • I investigate the issue and I found that the github server sends the data only if I load the page in my Chrome browser. This is very strange. So I added a Chrome browser user agent to the `getURL` when downloading multiple repos but nothing change. – Florent Oct 30 '17 at 23:55
  • I doubt this will stop you but https://github.com/robots.txt clearly indicates that they are discouraging scraping. https://help.github.com/articles/github-terms-of-service/ (specifically C 5) affords some use, but you've not stated your goal here. I also realize a ton of folks who answer q's on SO don't really care abt this, but for folks who do care abt ethics (the "can/should" that balances the "how") it'd be useful to ensure you're not dragging them into something that violates GH policies. – hrbrmstr Oct 31 '17 at 07:53
  • My goal is not to scrape GitHub for selling users' personal information but instead to measure and assess some projects quality. I dropped an email to their support. – Florent Oct 31 '17 at 09:09

1 Answers1

0
library(rvest)
source <- read_html("https://github.com/NemProject/vanitygen-cpp")
nb_results <- source %>%
  html_nodes(xpath="//*[@class='num text-emphasized']") %>%
  html_text() %>%
  magrittr::extract(4)

nb_results <- as.numeric(gsub("\n| ", "", nb_results))

for values 'commits', 'branch', 'release', 'contributors'

source <- read_html("https://github.com/leishman/kraken_ruby")
results <- source %>%
  html_nodes(xpath="//*[@class='num text-emphasized']") %>%
  html_text()

results <- gsub("\n| ", "", nb_results)
names(results) <- c('commits', 'branch', 'release', 'contributors')
  • I edited the question with a reproductible example that use multiple repos, could you please try your solution ? – Florent Oct 30 '17 at 21:35