The Github API doesn't provide pagination when searching for stats of a repository, so that the number of contributors is actually capped to 100. The workaround to get this information is by scraping the HTML element that contains the data.
I have found that the code below works fine to scrap the contributors count if the repo's URL was previously loaded in a browser (Chrome or Firefox), but it fails and returns an empty <span></span>
element when (or if) the HTML page wasn't loaded first in a web browser.
This looks very strange to me so I added a User Element in a getURL()
function to simulate a page load from my browser, but it doesn't change and the problem is still here.
To reproduce the problem I try scraping multiple repos, 14 in this example:
# Create a vector of repo's URLs
r <- GET("https://api.github.com/users/chartjs/repos")
c <- content(r)
urls <- unlist(lapply(c, "[[", "html_url"))
> urls
[1] "https://github.com/chartjs/Chart.BarFunnel.js" "https://github.com/chartjs/Chart.js"
[3] "https://github.com/chartjs/Chart.LinearGauge.js" "https://github.com/chartjs/Chart.smith.js"
[5] "https://github.com/chartjs/chartjs-chart-financial" "https://github.com/chartjs/chartjs-color"
[7] "https://github.com/chartjs/chartjs-color-string" "https://github.com/chartjs/chartjs-plugin-annotation"
[9] "https://github.com/chartjs/chartjs-plugin-datalabels" "https://github.com/chartjs/chartjs-plugin-deferred"
[11] "https://github.com/chartjs/chartjs-plugin-zoom" "https://github.com/chartjs/chartjs.github.io"
[13] "https://github.com/chartjs/gitbook-plugin-chartjs" "https://github.com/chartjs/www.chartjs.org"
complete <- NULL
while( is.null(complete) ) {
# Download HTML and select the span element
html <- lapply(urls, read_html)
nodes <- lapply(html, html_nodes, xpath="//*[@class='num text-emphasized']")
# Convert to text and select the fourth span
text <- lapply(nodes, html_text)
element <- lapply(text, magrittr::extract, 4)
# Unlist and convert as numeric
e <- as.numeric(unlist(element))
print(e)
if(!any(is.na(e))) complete <- TRUE
}
This is what I see when I execute the while()
. The NAs correspond to the repositories where the server fails to reply with the data.
The first tree iterations fail for 4 out of 14 of the repos but as soon as I load the URL https://github.com/chartjs/www.chartjs.org
in a web browser, which is the last repo, then the Github server reply with the data at the 4th iteration.
Finally, right before iteration n°6 I loaded another repo's page : the urls[13]
.
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 NA NA
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 NA NA
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 NA NA
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 NA 2 # urls[14] repo page loaded
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 NA 2
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 1 2 # urls[13] repo page loaded
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 1 2
[1] 3 231 3 NA 1 23 13 11 2 NA 11 2 1 2
Is it only me or one of you guys can replicate this issue ? is there a solution to this problem ?
Thanks,