Hadoop or Spark read tar.bzip2 read

Question

How can I read tar.bzip2 file in spark in parallel. I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.

Why do you need a custom reader? BZIP2 is already a splitable format, and so blocks should already be read in parallel — OneCricketeer, Jan 18 '17 at 17:35
I have tar.bzip2 files. and every tar.bzip2 file contains couple of json files. — Dhimant Jayswal, Jan 18 '17 at 18:15
Just to add , I have multiline json files into single folder and then I am compressing that folder to tar.bzip2. If i do sc.textFile("/tmp/xyz.tar.bz2") I get binary characters in the RDD[String] and cant parse json records. — Dhimant Jayswal, Jan 18 '17 at 20:21
Actually, there are multiple multiline json files are inside tar.bzip2 file and when I try to read using textFile(), I get binary characters and multiline jsons which I can't process. — Dhimant Jayswal, Jan 19 '17 at 20:37
Right, as I've hinted at you are reading a compressed file, not a text file. ["sc.textfile... cannot handle compressed data"](http://stackoverflow.com/a/38636089/2308683) — OneCricketeer, Jan 19 '17 at 21:46
I got your point @cricket_007 .But what is the solution to read tar.bz2 file in parallel via multiple executors ? — Dhimant Jayswal, Jan 24 '17 at 19:02
1) Can you show your code. 2) After reading the question again, why do you need a custom reader? Bzip2 is a already a supported format — OneCricketeer, Jan 24 '17 at 20:35
I understand that, of which I have already answered. `sc.textFile` cannot read compressed data. The FileStream of the bzip2 file is being read correctly. Try to apply `bz2` on a single JSON file, and you'd probably get the data back, as expected. — OneCricketeer, Jan 30 '17 at 15:36
Your binary character strings you see in your RDD are part of the TAR archive file that has been compressed using bzip2. As the question I already linked to shows, you can try [Read whole text files from compression](http://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) or you can [create a sequence file instead](https://stuartsierra.com/2008/04/24/a-million-little-files) (which Spark can read) — OneCricketeer, Jan 30 '17 at 15:36
I already have data and I want to read it. I cant change the input data format. — Dhimant Jayswal, Jan 30 '17 at 18:28
Trying to follow this thread. To be clear I have a tar.bzip2 file as well. I understand bzip 2 files are splittable. So when I read data into an RDD from the bzip2 file, each partition will contain binary tar data. @OneCricketeer are you suggesting to create a sequence file for each partition? — vi_ral, Jun 18 '20 at 00:26
@vi_ral I very rarely use SequenceFiles at all, but Spark should be able to read and handle bz2 just fine - https://stackoverflow.com/a/52981804/2308683 I'm not sure if TAR matters or not, because I also don't use those... Most files I deal with regularly are Avro/Parquet compressed with Snappy — OneCricketeer, Jun 18 '20 at 03:18
Yeah, good point tar does not get distributed so its not good to use with spark. As a side note this question is asking for a tar file thats bzipped. As far as I know this question is asking about a bzip tar file — vi_ral, Jun 18 '20 at 03:31

score 0 · Answer 1 · answered Jun 18 '20 at 03:35

So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.

To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

Hadoop or Spark read tar.bzip2 read

1 Answers1