4

I am running a small script in Pyspark, where I am extracting some data from hbase tables and creating a Pyspark data frame. I am trying to save the dataframe back onto local hdfs, and am running into an exit 50 error.

I am able to do the same operation successfully for comparatively smaller dataframes, but can't for large files. I can gladly share any code snippets and would appreciate any help. Also, the entire environment from SparkUI can be shared as a screenshot.

This is the config for my Spark(2.0.0) Properties (shown here as a dictionary). Deployed on yarn-client.

configuration={'spark.executor.memory':'4g',
        'spark.executor.instances':'32',
        'spark.driver.memory':'12g',
        'spark.yarn.queue':'default' 
       }

After I obtain the dataframe, I am trying to save it as:

df.write.save('user//hdfs//test_df',format = 'com.databricks.spark.csv',mode = 'append')

The following error block keeps on repeating until the job fails. I believe it might be an OOM error, but I have tried by giving as many as 128 executors, each with 16GB memory, but to no avail. Any workaround would be greatly appreciated.

Container exited with a non-zero exit code 50

17/09/25 15:19:35 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 64, fslhdppdata2611.imfs.micron.com): ExecutorLostFailure (executor 42 exited caused by one of the running tasks) Reason: Container marked as failed: container_e37_1502313369058_6420779_01_000043 on host: fslhdppdata2611.imfs.micron.com. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e37_1502313369058_6420779_01_000043
Exit code: 50
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:109)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:89)
    at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:392)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is hdfsprod
main : requested yarn user is hdfsprod
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /opt/hadoop/data/03/hadoop/yarn/local/nmPrivate/application_1502313369058_6420779/container_e37_1502313369058_6420779_01_000043/container_e37_1502313369058_6420779_01_000043.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
taylorSeries
  • 505
  • 2
  • 6
  • 18
Kartik Bagalore
  • 53
  • 1
  • 1
  • 7
  • As of spark 2.x, CSV is a top-level format, so you can use `df.write.csv` instead of using the databricks csv format. Start there and do you know how large the inputs and outputs are as well as how many partitions? I would also look at the stages to see if there's skew in execution time, potentially repartition/coalesce, and even enable spark speculation. A lot can be gleaned from the spark UI. – Garren S Sep 26 '17 at 19:00
  • @Garren, I tried changing to `df.write.csv`, but no difference. I observed that even if I do `df.count()`, the same error pops up, so I am unable to tell you how large the dataframes are. – Kartik Bagalore Sep 26 '17 at 19:23
  • 1
    It's _good_ that the same error cropped up even with a count and not just while writing, which means we now know it's the lineage execution that's causing the trouble. Here's one approach: do a count in the code at each transformation to narrow down the exact location in your code causing trouble. Start closest to your input to find just how early on in your process the problem crops up. Without the code of what you're doing, it's hard to lead you in the right direction. – Garren S Sep 26 '17 at 19:32
  • @Garren Thanks for your time and efforts. Well, this `df.count()` step is the very first step I execute after I run a Hbase scan on the hdfs tables. So, apart from a basic scan through the hdfs tables, nothing else is being executed prior to this. Also, if I do a `df.show()`, I get the expected output. It's only for operations such as `count, collect, write` that I run into this error. I am looking up on what exactly is lineage execution error as mentioned by you. Thanks again. – Kartik Bagalore Sep 26 '17 at 19:55
  • Aha! If .show() works but fully materialized actions do not (e.g. count, write), then I suspect a data/file corruption or the like issue, because _some_ records are returned successfully. Lineage execution is just a fancy way of saying the problem lies somewhere in the process of actually doing something with the data, such as reading, transforming, etc. Based on your explanation, the root problem likely has nothing to do with spark, but rather the format/storage you're reading from. – Garren S Sep 26 '17 at 20:55
  • @Garren I am reading these files, which are `.trv` format. Also, I tried out the same set of operations for a different set of files (different date/time related files), and it works perfectly. I was able to do `df.write.save`, `df.count`, and other operations. I am at a loss to understand how ton proceed. – Kartik Bagalore Sep 27 '17 at 15:01
  • I'm not familiar with TRV files. If they are TSV, you may want to check for tab character in the file that would cause offsets or file corruption – Garren S Sep 27 '17 at 15:29

1 Answers1

0

The exit code seems to come from org.apache.spark.util.SparkExitCode (based on this answer).


Accordingly, exit code 50 should mean UNCAUGHT_EXCEPTION.

akki
  • 2,021
  • 1
  • 24
  • 35