155

@EZGraphs on Twitter writes: "Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"

I was also trying to do this today, but ended up just downloading the zip file manually.

I tried something like:

fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")

but I feel as if I'm a long way off. Any thoughts?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jeromy Anglim
  • 33,939
  • 30
  • 115
  • 173

10 Answers10

211

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Dirk, would you mind expanding on how to extract data from a `.z` archive? I can read from a url connection with `readBin(url(x, "rb"), 'raw', 99999999)`, but how would I extract the contained data? The `uncompress` package has been removed from CRAN - is this possible in base R (and if so, is it restricted to *nix systems?)? Happy to post as a new question if appropriate. – jbaums Apr 25 '13 at 03:36
  • 4
    See `help(gzfile)` -- I was thinking that the gzip protocol may now uncompress (stone old) .z files too now that the patent has long expired. It may not. Who uses .z anyways? The 1980s called, they want their compression back ;-) – Dirk Eddelbuettel Apr 25 '13 at 03:41
  • Thanks - I can't get it to work, so perhaps it's unsupported after all. The Australian Bureau of Meteorology provides some of their data as .z, unfortunately! – jbaums Apr 25 '13 at 04:16
  • FYI It does not work with `readRDS()` (at least for me). From what I can tell, the file needs to be in a kind of file that you can read with `read.table()`. – jessi Aug 23 '14 at 20:14
  • To echo the comment by @jessi , it doesn't work with readxl::read_excel() either – Brent Brewington Feb 05 '17 at 17:42
  • 2
    you'll also want to close the connection. R can only have 125 open at once. Something like con <- unz(temp, "a1.dat"); data <- read.table(con); close(con); – pdb Jun 22 '17 at 15:42
  • 1
    Using library(archive) one can also do read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols()) which I find more convenient (it also supports all major archive formats & is faster than untar or unz I believe). To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX). So for me that would be the preferred option. – Tom Wenseleers Jul 11 '22 at 15:42
  • 1
    Sure. Add-on packages are great and and often offer complementary functionality _today_. But my answer was written _twelve years ago_ and it is nice that it already worked then, or eleven years before `archive` appeared. And likely will work for years to come whereas add-on packages sometimes disappear, or change. – Dirk Eddelbuettel Jul 11 '22 at 16:55
34

Just for the record, I tried translating Dirk's answer into code :-P

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
gd047
  • 29,749
  • 18
  • 107
  • 146
23

I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.

download(url, dest="dataset.zip", mode="wb") 
unzip ("dataset.zip", exdir = "./")
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
unixcreeper
  • 330
  • 2
  • 6
13

For Mac (and I assume Linux)...

If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:

library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")

In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:

dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
dnlbrky
  • 9,396
  • 2
  • 51
  • 64
  • when I tried your solution for multiple files, I'm getting an error that `File is empty:` – bshelt141 Aug 16 '17 at 15:06
  • Did not work form either. The only one that worked was `read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())` below. Also, The only way to read .zip directly from shinyapps.io – IVIM Jan 16 '23 at 00:13
11

Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.

url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))

unlink(c(temp, temp2))
ColinTea
  • 998
  • 1
  • 9
  • 15
7

Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols()) which I find more convenient & is faster.

It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.

To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)

This works on all platforms & given the superior performance for me would be the preferred option.

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • 1
    This is great answer ! - It allows me to read a remote .zip file from shinyapp, which none of other answers can do. Also, a tip: You do need to use `readr::read_csv(...)` here, and with `readr::cols()` . I tried `data.table::fread(...)` and it did not work. – IVIM Feb 05 '23 at 01:52
6

To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.

library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)

I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
Mallick Hossain
  • 651
  • 5
  • 13
4

Try this code. It works for me:

unzip(zipfile="<directory and filename>",
      exdir="<directory where the content will be extracted>")

Example:

unzip(zipfile="./data/Data.zip",exdir="./data")
Peter Badida
  • 11,310
  • 10
  • 44
  • 90
1

rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.

library(rio)

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))

# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))

# delete the files and directories
unlink(td)
camnesia
  • 2,143
  • 20
  • 26
0

I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:

zip.url <- "url_address.zip"

dir <- getwd()

zip.file <- "file_name.zip"

zip.combine <- as.character(paste(dir, zip.file, sep = "/"))

download.file(zip.url, destfile = zip.combine)

unzip(zip.file)
Coder-256
  • 5,212
  • 2
  • 23
  • 51