4

I have HDFS directory with 13.2 GB and 4 files in it. I am trying to read all files using wholeTextFile method in spark, But i have some issues

This is my code.

val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)

Below is the spark Submit.

spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
  1. even though i give min partitions 32, it is storing in 4 partitions only.
  2. My spark submit is correct or not?

Error Below

Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
    at org.apache.hadoop.util.Shell.run(Shell.java:507)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
    at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
    at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)



Container exited with a non-zero exit code 52
.
Driver stacktrace:
  • 2
    Possible duplicate of [Spark textFile vs wholeTextFiles](https://stackoverflow.com/questions/47129950/spark-textfile-vs-wholetextfiles). This should explain why `wholeTextFiles` can fail when the files are large and why you only get 4 partitions. Try using `textFile` if possible. – Shaido Jun 18 '19 at 06:50
  • 1
    can you provide us error that you are getting while reading file from HDFS – Nikhil Suthar Jun 18 '19 at 07:43
  • Error message is added to Question @Nikk – Ganesh Dogiparthi Jun 19 '19 at 10:02

1 Answers1

0
  1. Even though i give min partitions 32, it is storing in 4 partitions only.

You can refer below link

Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

  1. My spark submit is correct or not?

Syntax is correct but value you have passed is more than it needed. I mean you are giving 32 * 15 = 480 GB to Executors + 25 GB to driver just to process 13 GB data? Giving more executors and more memory does not give efficient result. Sometime it cause overhead and also failure due to lack of resources
Error is also showing issue with resources you are using. For processing only 13 GB data you should use like below configurations (not exactly, you have to calculate):

Executors # 6 Core #5 Executor-Memory 5 GB Driver Memory 2 GB

For more details & calculation you can refer below link:

How to tune spark executor number, cores and executor memory?

Note: Driver does not require more memory than Executor so Driver memory should be less or equal to Executor memory in most of cases.

Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24