0

I have a set of large xml files, zipped together in a singe file and many such zip files. I was using Mapreduce earlier to parse the xml using custom inputformat and recordreader setting the splittable=false and reading the zip and xml file.

I am new to Spark. Can someone help me how can I prevent spark from splitting the zip file and process multiple zips in parallel as I am able to do in MR.

Pooja3101
  • 701
  • 3
  • 8
  • 13
  • Can you please provide an example or a use case thanks ! Because I am not able to understand your question – Shivansh Jul 18 '16 at 10:29
  • I have few large xmls and they are zipped across multiple zips. I just want to parse my zip and xml without being split across based on block size. – Pooja3101 Jul 18 '16 at 13:28

1 Answers1

0

AFAIk ! The answer to your question is provided here by @holden : Please take a look ! Thanks :)

Community
  • 1
  • 1
Shivansh
  • 3,454
  • 23
  • 46
  • 1
    That's very useful, but not an answer as it is. You could either mark the question as a duplicate and suggest closing it in favor of the one you link to, or you could link to the other answer, but also edit your answer to quote the important bits from the one you link to. – Daniel Darabos Jul 18 '16 at 11:11
  • I have gone through that link you shared. I just have one doubt. How can i parse a single file without being split across. like in MR, i am using the property splitable=false in my custom inputformat class. How can i achieve the same in Spark. – Pooja3101 Jul 18 '16 at 13:30
  • I tried as below but getting error. val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name: String, content: PortableDataStream) => new ZipInputStream(content.open) } > > :95: error: type mismatch; > found : java.util.zip.ZipInputStream > required: TraversableOnce[?] > val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name, content) => new ZipInputStream(content.open) } – Pooja3101 Jul 21 '16 at 15:26