24

I need to automate R to read a csv datafile that's into a zip file.

For example, I would type:

read.zip(file = "myfile.zip")

And internally, what would be done is:

  • Unzip myfile.zip to a temporary folder
  • Read the only file contained on it using read.csv

If there is more than one file into the zip file, an error is thrown.

My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?

UPDATE

Here's the function I wrote based on @Paul answer:

read.zip <- function(zipfile, row.names=NULL, dec=".") {
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    dir.create(zipdir)
    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)
    # Get the files into the dir
    files <- list.files(zipdir)
    # Throw an error if there's more than one
    if(length(files)>1) stop("More than one data file inside zip")
    # Get the full name of the file
    file <- paste(zipdir, files[1], sep="/")
    # Read the file
    read.csv(file, row.names, dec)
}

Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!

Jack Wasey
  • 3,360
  • 24
  • 43
João Daniel
  • 8,696
  • 11
  • 41
  • 65
  • possible duplicates? at: http://stackoverflow.com/questions/3053833/using-r-to-download-zipped-data-file-extract-and-import-data; http://stackoverflow.com/questions/7044808/using-r-to-download-gzipped-data-file-extract-and-import-data/7045059#7045059 – aatrujillob Jan 24 '12 at 12:57
  • Actually the first link it's not related, since my problem wasn't unzipping the file, but to get the name of the files inside the zip. But yes, the second shows the `list.files` command, that was (so far) unknown by me. – João Daniel Jan 24 '12 at 13:40
  • @jdanielnd: you can get to the file names in the zip file using `unzip(file, list=TRUE)`, as I used in my answer. – Joshua Ulrich Jan 25 '12 at 21:28
  • Does this answer your question? [Extract certain files from .zip](https://stackoverflow.com/questions/32870863/extract-certain-files-from-zip) – outis Oct 27 '21 at 10:09

9 Answers9

11

Another solution using unz:

read.zip <- function(file, ...) {
  zipFileInfo <- unzip(file, list=TRUE)
  if(nrow(zipFileInfo) > 1)
    stop("More than one data file inside zip")
  else
    read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
10

You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:

l = list.files(temp_path)
read.csv(l[1])

assuming your tempdir location is stored in temp_path.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • That's just what I was looking for! I was trying to use `system("ls")` but it didn't returned an R object - like a vector. Thanks! – João Daniel Jan 24 '12 at 12:54
  • @JoãoDaniel `system("ls")` isn't the way to go here but `system("ls", intern = TRUE)` is probably what you were hoping for – Dason Sep 06 '13 at 20:28
4

I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:

read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
    fp <- file.path(zipdir, f)
    return(read.csv(fp, ...))
})
return(csv.data)}
  • I had to use `recursive=TRUE` in `list.files()`; Also, instead of using `grep()` to subset in the second definition of `files`, you can simply make use of the `pattern` argument in `list.files`: `files <- list.files(zipdir, recursive=TRUE, pattern="\\.csv$"`. I also made a naming improvement to the returned list, `names(csv.data) <- gsub(".+\\/", "", files,perl=T)`. I might add these changes as a new answer, but feel free to update your approach. Thanks! – rbatt Aug 03 '15 at 18:15
  • 1
    @rbatt Great feedback. I was still new to R when I wrote that so I didn't know to look for options like `pattern` and `recursive`. I doubt I'll edit my answer but I'd enjoy seeing your code. Thanks! – Corned Beef Hash Map Aug 04 '15 at 18:24
2

If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:

zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))

This solution also has the advantage that no temporary files are created.

Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
2

Here is an approach I am using that is based heavily on @Corned Beef Hash Map 's answer. Here are some of the changes I made:

  • My approach makes use of the data.table package's fread(), which can be fast (generally, if it's zipped, sizes might be large, so you stand to gain a lot of speed here!).

  • I also adjusted the output format so that it is a named list, where each element of the list is named after the file. For me, this was a very useful addition.

  • Instead of using regular expressions to sift through the files grabbed by list.files, I make use of list.file()'s pattern argument.

  • Finally, I by relying on fread() and by making pattern an argument to which you could supply something like "" or NULL or ".", you can use this to read in many types of data files; in fact, you can read in multiple types of at once (if your .zip contains .csv, .txt in you want both, e.g.). If there are only some types of files you want, you can specify the pattern to only use those, too.

Here is the actual function:

read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){

    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()

    # Create the dir using that name
    dir.create(zipdir)

    # Unzip the file into the dir
    unzip(zipfile, exdir=zipdir)

    # Get a list of csv files in the dir
    files <- list.files(zipdir, rec=TRUE, pattern=pattern)

    # Create a list of the imported csv files
    csv.data <- sapply(files, 
        function(f){
            fp <- file.path(zipdir, f)
            dat <- fread(fp, ...)
            return(dat)
        }
    )

    # Use csv names to name list elements
    names(csv.data) <- basename(files)

    # Return data
    return(csv.data)
}
Community
  • 1
  • 1
rbatt
  • 4,677
  • 4
  • 23
  • 41
1

The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.

head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))

read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
  zipfile <- tempfile()
  download.file(url = url, destfile = zipfile, quiet = TRUE)
  zipdir <- tempfile()
  dir.create(zipdir)
  unzip(zipfile, exdir = zipdir) # files="" so extract all
  files <- list.files(zipdir)
  if (is.null(filename)) {
    if (length(files) == 1) {
      filename <- files
    } else {
      stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
    }
  } else { # filename specified
    stopifnot(length(filename) ==1)
    stopifnot(filename %in% files)
  }
  file <- paste(zipdir, files[1], sep="/")
  do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Jack Wasey
  • 3,360
  • 24
  • 43
1

Another approach that uses fread from the data.table package

fread.zip <- function(zipfile, ...) {
  # Function reads data from a zipped csv file
  # Uses fread from the data.table package

  ## Create the temporary directory or flush CSVs if it exists already
  if (!file.exists(tempdir())) {dir.create(tempdir())
  } else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
  }

  ## Unzip the file into the dir
  unzip(zipfile, exdir=tempdir())

  ## Get path to file
  file <- list.files(tempdir(), pattern = "*.csv", full.names = T)

  ## Throw an error if there's more than one
  if(length(file)>1) stop("More than one data file inside zip")

  ## Read the file
  fread(file, 
     na.strings = c(""), # read empty strings as NA
     ...
  )
}

Based on the answer/update by @joão-daniel

altabq
  • 1,322
  • 1
  • 20
  • 33
1

unzipped file location

outDir<-"~/Documents/unzipFolder"

get all the zip files

zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)

unzip all your files

purrr::map(.x = zipF, .f = unzip, exdir = outDir)

Community
  • 1
  • 1
Gucci148
  • 1,977
  • 1
  • 13
  • 4
0

I just wrote a function based on top read.zip that may help...

read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
    # function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r

    # check the files within zip
    unzfiles <- unzip(zipfile, list=TRUE)
    if (is.na(internalfile) || is.numeric(internalfile)) {
        internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
    }
    # Create a name for the dir where we'll unzip
    zipdir <- tempfile()
    # Create the dir using that name
    if (verbose) catf("Directory created:",zipdir,"\n")
    dir.create(zipdir)
    # Unzip the file into the dir
    if (verbose) catf("Unzipping file:",internalfile,"...")
    unzip(zipfile, file=internalfile, exdir=zipdir)
    if (verbose) catf("Done!\n")
    # Get the full name of the file
    file <- paste(zipdir, internalfile, sep="/")
    if (verbose) 
        on.exit({ 
            catf("Done!\nRemoving temporal files:",file,".\n") 
            file.remove(file)
            file.remove(zipdir)
            }) 
    else
        on.exit({file.remove(file); file.remove(zipdir);})
    # Read the file
    if (verbose) catf("Reading File...")
    read.function(file, ...)
}