2

I have a large tar.gz file (>2GB) from which I want to read a specific .dat file in R without unzipping the original tar.gz file.

I tried to follow this post as follows:

p35_data_path <- "~/P35_fullset.tar.gz" 
file.exists(p35_data_path) #TRUE

# Try to readin foldera/class1/mydata.dat from the zip file
mydata <- read.table(unz(p35_data_path
                       , "foldera/class1/mydata.dat"))

When I run the above I get a read.table error as

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
  cannot open zip file '~/P35_fullset.tar.gz'

The "~/P35_fullset.tar.gz" file exists. And the specific file within it definitely exists foldera/class1/mydata.dat.

Could anyone please assist in rectifying this?

user4687531
  • 1,021
  • 15
  • 30
  • 2
    You mean "not unzipping the original `tar.gz` to ssd/spindle-based storage" since it does have to be decompressed in memory. NOTE also that `unz()` is only for `.zip` files. Not sure where you got the impression it handled `.tar.gz` from. Check out the [`archive`](https://github.com/jimhester/archive) package. – hrbrmstr Jan 12 '18 at 04:06
  • @hrbrmstr. Thanks - I basically didn't want to untar all of the contents to disk. I could untar specific files to disk but this took too long for a single file, so thought there may be a way to do it by just accessing them individually. I'll check out the `archive` package and report back with queries – user4687531 Jan 12 '18 at 04:31
  • 1
    This worked for me (thanks @hrbrmstr) ```p35_data_path <- "~/P35_fullset.tar.gz" # Try to readin foldera/class1/mydata.dat from the zip file x <- archive::archive_read(archive = p35_data_path , file = "foldera/class1/mydata.dat") mydata <- readr::read_csv(x)``` – user4687531 Jan 12 '18 at 04:49
  • Added an option using library(archive) which worked well for me... – Tom Wenseleers Jul 11 '22 at 16:25

3 Answers3

0
da < -untar(Tarfile, files = NULL, list = TRUE, exdir = ".",compressed = "gzip") # this is for listing the files under TAR

da < -as.data.table(da) # save listed files as datatable 

Then use your own filter technique to filtes the files like I did and saved in Name:

g <- c(da$Name)`  # then list the names

untar(Tarfile, files = g, list = FALSE, exdir = "exportRQA",compressed = "gzip") # This is finally the command for extracting the specific files.
user438383
  • 5,716
  • 8
  • 28
  • 43
0

Using library(archive) one can read in a particular csv file within an archive without having to UNZIP it first :

library(archive)
library(readr)
read_csv(archive_read("~/P35_fullset.tar.gz", file = 1), col_types = cols())

(adjust file=XX as appropriate)

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
-1

You should be able to unpack the archive with base R's untar():

p35_data_path <- "~/P35_fullset.tar.gz" 
file.exists(p35_data_path) #TRUE

# Try to readin foldera/class1/mydata.dat from the .tar.gz file
untar(p35_data_path, "foldera/class1/mydata.dat")  # this extracts the file from archive
mydata <- read.table("foldera/class1/mydata.dat")  # so you can read it

The file is extracted inside the folder, however, you can specify where to extract it. See documentation for more info.