3

I have a bunch of tar.gz files which I would like to process with Spark without decompressing them.

A single archive is about ~700MB and contains 10 different files but I'm interested only in one of them (which is ~7GB after decompression).

I know that context.textFile supports tar.gz but I'm not sure is it the right tool when an archive contains more then one file. What happens is that Spark will return content of all files (line by line) in the archive including file names with some binary data.

Is there any way to select which file from tar.gz I would like to map?

Lukasz Kujawa
  • 3,026
  • 1
  • 28
  • 43

1 Answers1

0

AFAIK, I'd suggest sc.binaryFiles method... please see below doc. where file name and file content are present, you can map and pickup the file you want and process that.


public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
                                                           int minPartitions)

Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data) For example, if you have the following files:

hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn

Do val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path"),

then rdd contains

(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)

Also, check this

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121