0

I have the URL of a file I want to download and from here directly read into R. It is an .xlsx file from Gapminder. The Webiste to find it is "https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx".

I have tried two things:

url <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
download.file(url,destfile="example.xlsx")
example <- read_excel("example.xlsx")

with the error

> url <- "https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx"
> download.file(url,destfile="example.xlsx")
trying URL 'https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx'
Content type 'text/html; charset=utf-8' length unknown
downloaded 61 KB

> example <- read_excel("example.xlsx")
Error: Evaluation error: zip file 'C:\Users\user\Documents\example.xlsx' cannot be opened.
> 

and

library(RCurl)
URL <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
x <- getURL(URL)
out <- read.csv(textConnection(x))
head(out[1:6])

with the result

> library(RCurl)
> URL <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
> x <- getURL(URL)
> out <- read.csv(textConnection(x))
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
> head(out[1:6])
Error in `[.data.frame`(out, 1:6) : undefined columns selected

So the file is either not downloaded at all or incorrectly read in. How can I simply download that one excel-file from the webpage and read in?

Ben Mat
  • 31
  • 1
  • 1
  • 8

2 Answers2

4

Despite the URL, the GitHub server doesn't actually serve the file unless raw=true is passed in as a parameter.

Use this URL instead: https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx?raw=true

James
  • 65,548
  • 14
  • 155
  • 193
  • thank you. it works for downloading the file indeed. however, it downloads the first sheet of the excel file only. in consequence, I can´t read the data into R. What would I have to add to the URL you provided to access and download sheet two of the file that is named "countries_and_territories" ? – Ben Mat Jan 08 '20 at 12:26
  • @BenMat The whole file gets downloaded. `read_excel` can only read one sheet at a time. You can specify which sheet with the `sheet` parameter. – James Jan 08 '20 at 13:26
3

I don't think just changing the url as @James says is enough, (note that the url in question forwards your browser to a different url with the actual file - in your case it's https://raw.githubusercontent.com/Gapminder-Indicators/u5mr/master/u5mr-by-gapminder.xlsx).

But this is not the only problem. You need to write the downloaded file in binary format using mode = "wb" before you can open it with read_excel, also specifying the worksheet that you want to open.

Here's a working example:

library(readxl)

destfile <- path.expand("~/example.xlsx")
url      <- paste0("https://raw.githubusercontent.com/Gapminder-Indicators",
                    "/u5mr/master/u5mr-by-gapminder.xlsx")

download.file(url, destfile = destfile, mode = "wb")
example <- read_excel(destfile, sheet = 2)

and now you have the spreadsheet as a tibble:

print(example)
#> # A tibble: 275 x 305
#>    geo.name indicator.name geo   indicator `1800` `1801` `1802` `1803` `1804` `1805`
#>    <chr>    <chr>          <chr> <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Abkhazia Child mortali~ abkh  u5mr         NA     NA     NA     NA     NA     NA 
#>  2 Afghani~ Child mortali~ afg   u5mr        469.   469.   469.   469.   469.   469.
#>  3 Akrotir~ Child mortali~ akr_~ u5mr         NA     NA     NA     NA     NA     NA 
#>  4 Albania  Child mortali~ alb   u5mr        375.   375.   375.   375.   375.   375.
#>  5 Algeria  Child mortali~ dza   u5mr        460.   460.   460.   460.   460.   460.
#>  6 America~ Child mortali~ asm   u5mr         NA     NA     NA     NA     NA     NA 
#>  7 Andorra  Child mortali~ and   u5mr         NA     NA     NA     NA     NA     NA 
#>  8 Angola   Child mortali~ ago   u5mr        486.   486.   486.   486.   486.   486.
#>  9 Anguilla Child mortali~ aia   u5mr         NA     NA     NA     NA     NA     NA 
#> 10 Antigua~ Child mortali~ atg   u5mr        474.   470.   466.   462.   458.   455.
#> # ... with 265 more rows, and 295 more variables: `1806` <dbl>, `1807` <dbl>
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    This works nicely, with the caveat that you need `sheet = 2` or `sheet = 'countries_and_territories'` inside the call to `read_excel()`. – ulfelder Jan 08 '20 at 12:31