1

How can I read tar.bzip2 file in spark in parallel. I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Dhimant Jayswal
  • 123
  • 1
  • 10
  • Why do you need a custom reader? BZIP2 is already a splitable format, and so blocks should already be read in parallel – OneCricketeer Jan 18 '17 at 17:35
  • I have tar.bzip2 files. and every tar.bzip2 file contains couple of json files. – Dhimant Jayswal Jan 18 '17 at 18:15
  • Exactly... And that filetype is splitable – OneCricketeer Jan 18 '17 at 18:15
  • Just to add , I have multiline json files into single folder and then I am compressing that folder to tar.bzip2. If i do sc.textFile("/tmp/xyz.tar.bz2") I get binary characters in the RDD[String] and cant parse json records. – Dhimant Jayswal Jan 18 '17 at 20:21
  • Sure, because the JSON is compressed in an archive file. – OneCricketeer Jan 18 '17 at 20:23
  • Actually, there are multiple multiline json files are inside tar.bzip2 file and when I try to read using textFile(), I get binary characters and multiline jsons which I can't process. – Dhimant Jayswal Jan 19 '17 at 20:37
  • Right, as I've hinted at you are reading a compressed file, not a text file. ["sc.textfile... cannot handle compressed data"](http://stackoverflow.com/a/38636089/2308683) – OneCricketeer Jan 19 '17 at 21:46
  • I got your point @cricket_007 .But what is the solution to read tar.bz2 file in parallel via multiple executors ? – Dhimant Jayswal Jan 24 '17 at 19:02
  • 1) Can you show your code. 2) After reading the question again, why do you need a custom reader? Bzip2 is a already a supported format – OneCricketeer Jan 24 '17 at 20:35
  • @cricket_007 it is not bzip2, it is tar.bzip2. – Dhimant Jayswal Jan 30 '17 at 15:22
  • I understand that, of which I have already answered. `sc.textFile` cannot read compressed data. The FileStream of the bzip2 file is being read correctly. Try to apply `bz2` on a single JSON file, and you'd probably get the data back, as expected. – OneCricketeer Jan 30 '17 at 15:36
  • Your binary character strings you see in your RDD are part of the TAR archive file that has been compressed using bzip2. As the question I already linked to shows, you can try [Read whole text files from compression](http://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) or you can [create a sequence file instead](https://stuartsierra.com/2008/04/24/a-million-little-files) (which Spark can read) – OneCricketeer Jan 30 '17 at 15:36
  • I already have data and I want to read it. I cant change the input data format. – Dhimant Jayswal Jan 30 '17 at 18:28
  • Trying to follow this thread. To be clear I have a tar.bzip2 file as well. I understand bzip 2 files are splittable. So when I read data into an RDD from the bzip2 file, each partition will contain binary tar data. @OneCricketeer are you suggesting to create a sequence file for each partition? – vi_ral Jun 18 '20 at 00:26
  • @vi_ral I very rarely use SequenceFiles at all, but Spark should be able to read and handle bz2 just fine - https://stackoverflow.com/a/52981804/2308683 I'm not sure if TAR matters or not, because I also don't use those... Most files I deal with regularly are Avro/Parquet compressed with Snappy – OneCricketeer Jun 18 '20 at 03:18
  • Yeah, good point tar does not get distributed so its not good to use with spark. As a side note this question is asking for a tar file thats bzipped. As far as I know this question is asking about a bzip tar file – vi_ral Jun 18 '20 at 03:31

1 Answers1

0

So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.

To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

vi_ral
  • 369
  • 4
  • 19