Read and process a *.tar.gz file with PySpark

Asked Sep 19 '18 at 11:36

Active Sep 19 '18 at 11:36

Viewed 1,565 times

Let us assume I have a tar.gz archive with 7 csv files in it. How to manipulate such a tar.gz archive to get each csv file in a separate RDD or DataFrame.

I have tried the possibility mentioned here but I get all of the 7 csv files in one RDD, which is also the same as doing a simple sc.textFile().

I am using Spark 2.*

asked Sep 19 '18 at 11:36

sdikby

1,383
14
30

Actually the code which is provided by the link doesn't do the same as sc.textFile(). This code returns you the RDD of elements where each element is the whole content of one file inside the archive. So you can easily filter this RDD as you want to find the content of files you need. What's the purpose to have many RDDs? – maxteneff Sep 21 '18 at 13:35
I want to persist each file as an avro file on HDFS which should be accessible from a hive table with the same schema – sdikby Sep 21 '18 at 14:30

Read and process a *.tar.gz file with PySpark

0 Answers0