1

I have about ~1000 tar.gz files (about 2 GB/file compressed) each containing bunch of large .tsv (tab separated) files e.g. 1.tsv, 2.tsv, 3.tsv, 4.tsv etc.

I want to work in R on a subset of the .tsv files (say 1.tsv, 2.tsv) without extracting the .tar.gz files, in order to save space/time.

I tried looking around but couldn't find a library or a routine to stream the tar.gz files through memory and extracting data from them on the fly. In other languages there are ways of doing this efficiently. I would be surprised if one couldn't do this in R

Does anyone know of a way to accomplish this in R? Any help is greatly appreciated! Note: Unzipping/untarring the file is not an option. I want to extract relevant fields and save them in a data.frame without extracting the file

Blade Runner
  • 263
  • 2
  • 11
  • 4
    Possible duplicate of [unzip a tar.gz file in R?](https://stackoverflow.com/questions/7151145/unzip-a-tar-gz-file-in-r) – S Rivero Jul 06 '17 at 20:37
  • No. Unzipping or untarring the file is not an option. I want to read it's contents without unzipping it – Blade Runner Jul 06 '17 at 20:59
  • Look at `?untar`. You can list the files and parse them. – S Rivero Jul 06 '17 at 21:07
  • I've seen it. Even after listing you still have to extract the files to process them. – Blade Runner Jul 06 '17 at 21:12
  • The files you want (1.tsv, 2.tsv) will need to be extracted at some point if you're going to get them into memory and work with them, right? With untar you can specify `files = c("1.tsv", "2.tsv")` to extract only these. – Eric Watt Jul 06 '17 at 21:33

0 Answers0