How can I read tar.bzip2 file in spark in parallel. I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.
Asked
Active
Viewed 753 times
1
1 Answers
0
So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.
To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

vi_ral
- 369
- 4
- 19
sc.textFile("/tmp/xyz.tar.bz2")
I get binary characters in the RDD[String] and cant parse json records. – Dhimant Jayswal Jan 18 '17 at 20:21