0

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code

 val conf = new SparkConf()
 val sc = new SparkContext(conf)
    val data = sc.wholeTextFiles("path to gz file")
    data.collect().foreach(println);

.gz file is 28 mb and when i do the spark submit using this command

spark-submit --class sample--master local[*] target\spark.jar

It gives ma Java Heap space issue in the console .

Is this the best way of reading .gz file and if yes how could i solve java heap error issue .

enter image description here

Thanks

baiduXiu
  • 167
  • 1
  • 3
  • 15
  • Your solution is in [reading multiple compressed files](https://stackoverflow.com/questions/38635905/reading-in-multiple-files-compressed-in-tar-gz-archive-into-spark) – Ramesh Maharjan Jun 17 '17 at 10:19
  • The original answer is actually here https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark – eliasah Jun 17 '17 at 10:23
  • 2
    Possible duplicate of [Read whole text files from a compression in Spark](https://stackoverflow.com/questions/36604145/read-whole-text-files-from-a-compression-in-spark) – mrsrinivas Jun 17 '17 at 11:27

1 Answers1

0

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark

1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats) 2) Or at least use sc.textFile() instead of wholeTextFiles 3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

Garren S
  • 5,552
  • 3
  • 30
  • 45