0

I have large compressed(.zip) files around 10 GB each. I need to read content of file inside zip without unzipping it and want to apply transformations.

   System.setProperty("HADOOP_USER_NAME", user)

   println("Creating SparkConf")
   val conf = new SparkConf().setAppName("DFS Read Write Test")

   println("Creating SparkContext")
   val sc = new SparkContext(conf)

   var textFile = sc.textFile(filePath)

   println("Count...."+textFile.count())

   var df = textFile.map(some code)

` When i passing a any .txt,.log,.md etc.. above is working fine. But when pass .zip files the it giving Zero Count.

  1. Why it is giving count Zero
  2. Please suggest me correct way of doing this, If am totally wrong.
Siva Kumar
  • 632
  • 3
  • 9
  • 19

1 Answers1

0

You have to perform this task like this, it's a different operation then simply loading other kind of files which spark supports.

val rdd  = sc.newAPIHadoopFile("file.zip", ZipFileInputFormat.class,Text.class, Text.class, new Job().getConfiguration());
Kshitij Kulshrestha
  • 2,032
  • 1
  • 20
  • 27