Spark read just part of huge tar.gz file

Question

I have a big tar.gz file (let's say 3GB) when you untar it, it goes to between 16 to 25 GB. The untar version has this structure:

backup
├── folder1
│   ├── somestuff.aof
│   └── dump.rdb
└── low

What I only care is dump.rdb, but I don't want to read the whole tar.gz file and untar it in memory and then read the dump.rdb file since I have limited memory.
What's the best sparky way to just read dump.rdb? if that's not possible what's the best way to solve the memory issue?

P.S: I am using Amazon AWS

Use [pigz](http://zlib.net/pigz/) to untar file and process, and if you are using AWS, try using [s3-dist-cp](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html) --outputCodec=NONE which is parallel MR job to untar files in HDFS and read using spark which can be faster. — Pavithran Ramachandran, Feb 14 '18 at 19:53
Possible duplicate of [Read whole text files from a compression in Spark](https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) — Alper t. Turker, Feb 15 '18 at 01:18
@PavithranRamachandran --outputCodec=NONE only remove the gz file still I need to do something about tar — Am1rr3zA, Feb 20 '18 at 20:30
From [this](https://forums.aws.amazon.com/thread.jspa?threadID=152505) it seems that it does one level of decompression. So a work around this is to ungz and store it in S3/HDFS depending upon the availability of space in the cluster and use that as input and untar and store it in HDFS. I will search for a better solution. — Pavithran Ramachandran, Feb 20 '18 at 21:58

Spark read just part of huge tar.gz file

0 Answers0