1

Consider a tar.gz file of a directory which containing a lot of individual files.

From within R I can easily extract the name of the individual files with this command:

fileList <- untar(my_tar_dir.tar.gz, list=T)

Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?

  • 1
    Have you seen [unzip a tar.gz file in R?](https://stackoverflow.com/a/7151322/4752675) The accepted answer seems to address extracting only one file. – G5W Jan 04 '19 at 15:22
  • Ahh yes I can see that I was not specific enough - I do not want to unpack anything but read then in directly. Updated the question accordingly. – Kristoffer Vitting-Seerup Jan 04 '19 at 16:30
  • Added a solution below using library(archive) - that one should work & is a lot more elegant than the currently accepted answer... – Tom Wenseleers Jul 11 '22 at 16:22

2 Answers2

3

It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.

The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.

ParseTGZ<- function(archname){
  # open tgz archive
  tf <- gzfile(archname, open='rb')
  on.exit(close(tf))
  fnames <- list()
  offset <- 0
  nfile <- 0
  while (TRUE) {
    # go to beginning of entry
    # never use "seek" to re-locate in a gzipped file!
    if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
    # read file name
    fName <- rawToChar(readBin(tf, what="raw", n=100))
    if (nchar(fName)==0) break
    nfile <- nfile + 1
    fnames <- c(fnames, fName)
    attr(fnames[[nfile]], "offset") <- offset+512
    # read size, first skip 24 bytes (file permissions etc)
    # again, we only use readBin, not seek()
    readBin(tf, what="raw", n=24)
    # file size is encoded as a length 12 octal string, 
    # with the last character being '\0' (so 11 actual characters)
    sz <- readChar(tf, nchars=11) 
    # convert string to number of bytes
    sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
    attr(fnames[[nfile]], "size") <- sz
#    cat(sprintf('entry %s, %i bytes\n', fName, sz))
    # go to the next message
    # don't forget entry header (=512) 
    offset <- offset + 512*(ceiling(sz/512) + 1)
  }
# return a named list of characters strings with attributes?
  names(fnames) <- fnames
  return(fnames)
}

This will give you the exact position and length of all files in the tar.gz archive. Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.

extractTGZ <- function(archfile, filename) {
  # this function returns a raw vector
  # containing the desired file
  fp <- ParseTGZ(archfile)
  offset <- attributes(fp[[filename]])$offset
  fsize <- attributes(fp[[filename]])$size
  gzf <- gzfile(archfile, open="rb")
  on.exit(close(gzf))
  # jump to the byte position, don't use seek()
  # may be a bad idea on really large archives...
  readBin(gzf, what="raw", n=offset)
  # now read the data into a raw vector
  result <- readBin(gzf, what="raw", n=fsize)
  result
}

now, finally:

ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))

Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.

Alex Deckmyn
  • 1,017
  • 6
  • 11
  • Really nice solution - Thanks! Does it need an additional on.exit(close(gzf)) in the extractTGZ function? – Kristoffer Vitting-Seerup Jan 07 '19 at 13:02
  • Indeed! I added this in the answer. – Alex Deckmyn Jan 08 '19 at 09:14
  • With regards to the comment "# may be a bad idea on really large archives..." would you do it differently if you had thousands of (small) files within the tar.gz archive? – Kristoffer Vitting-Seerup Jan 09 '19 at 10:44
  • 1
    It's not the number of files, just the total size. If your (unzipped) archive is really large (say >1GB), than readBin() will, temporarily, create a very large vector. In that case it may be better to read, say, 10 x 100MB. Reading all this data is not ideal, I know, but it's the simplest way to avoid seek(). With thousands of small files, it will be the first step (ParseTGZ) that may become rather slow. – Alex Deckmyn Jan 09 '19 at 11:27
0

One can read in a csv within an archive using library(archive) as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):

library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())
Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103