I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz
.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz")
or sc.textFile("archive.tar.gz")
results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.