0

Environment

  1. EMR version:- emr-5.30.0
  2. Spark version:- 2.4.5
  3. Hadoop version:- 2.8.5

Did the following steps for pointing the spark event logs to S3:-

STEP 1:

sudo nano /etc/spark/conf/spark-defaults.conf

change the following properties

FROM:

spark.eventLog.dir               hdfs:///var/log/spark/apps
spark.history.fs.logDirectory    hdfs:///var/log/spark/apps

TO:

spark.eventLog.dir               s3a://com-tekioncloud-ml-dwh-tst/var/log/spark/apps
spark.history.fs.logDirectory    s3a://com-tekioncloud-ml-dwh-tst/var/log/spark/apps

Append at following properties end of file

spark.hadoop.fs.s3a.impl          org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.fast.upload   true

STEP 2:- (this step i did because while starting the spark history server, the class S3A filesystem not found was coming)

So i followed what is written in this stack overflow answer

sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/

It will copy these three files:

/usr/share/aws/aws-java-sdk/aws-java-sdk-core-1.11.759.jar
/usr/share/aws/aws-java-sdk/aws-java-sdk-s3-1.11.759.jar
/usr/lib/hadoop/hadoop-aws.jar

STEP 3:-

Restart spark history server

After that everything is working fine, spark jobs are running, pyspark shell is running. But when i run one streaming job i got the following error:-

2023-06-13 08:37:34,789 INFO fetchConfigInfo at 76: Start fetch connector function
2023-06-13 08:37:34,815 INFO fetchConfigInfo at 78: completed fetch connector function
2023-06-13 08:37:35,360 INFO deserialize_df at 41: columns received from topic : ['key', 'value', 'topic', 'partition', 'offset', 'timestamp', 'timestampType']
2023-06-13 08:37:36,147 INFO run at 2188: Callback Server Starting
2023-06-13 08:37:36,148 INFO run at 2192: Socket listening on ('127.0.0.1', 38511)
2023-06-13 08:37:38,012 INFO run at 2309: Callback Connection ready to receive messages
2023-06-13 08:37:38,012 INFO run at 2325: Received command c on object id p0
2023-06-13 08:37:38,012 INFO write_dataframe at 129: in write dataframe function
23/06/13 08:37:44 ERROR TransportClient: Failed to send RPC RPC 9032191979682419760 to /192.168.133.133:37466: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:958)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:716)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:708)
    at io.netty.channel.AbstractChannelHandlerContext.access$1700(AbstractChannelHandlerContext.java:56)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1102)
    at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1149)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1073)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
    at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)
23/06/13 08:37:44 ERROR YarnScheduler: Lost executor 1 on ip-192-168-133-133.us-west-1.compute.internal: Container from a bad node: container_1686639124522_0021_01_000002 on host: ip-192-168-133-133.us-west-1.compute.internal. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1686639124522_0021_01_000002
Exit code: 134
Exception message: /bin/bash: line 1: 18318 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /etc/alternatives/jre/bin/java -server -Xmx5120m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=36697' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-192-168-177-14.us-west-1.compute.internal:36697 --executor-id 1 --hostname ip-192-168-133-133.us-west-1.compute.internal --cores 3 --app-id application_1686639124522_0021 --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/__app__.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.apache.spark_spark-sql-kafka-0-10_2.11-2.4.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/io.delta_delta-core_2.11-0.6.1.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.apache.kafka_kafka-clients-2.0.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.spark-project.spark_unused-1.0.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.lz4_lz4-java-1.4.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.xerial.snappy_snappy-java-1.1.7.1.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.slf4j_slf4j-api-1.7.16.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr4-4.7.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr4-runtime-4.7.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr-runtime-3.5.2.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_ST4-4.0.8.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.glassfish_javax.json-1.0.4.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/com.ibm.icu_icu4j-58.2.jar > /var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002/stdout 2> /var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 18318 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /etc/alternatives/jre/bin/java -server -Xmx5120m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=36697' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-192-168-177-14.us-west-1.compute.internal:36697 --executor-id 1 --hostname ip-192-168-133-133.us-west-1.compute.internal --cores 3 --app-id application_1686639124522_0021 --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/__app__.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.apache.spark_spark-sql-kafka-0-10_2.11-2.4.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/io.delta_delta-core_2.11-0.6.1.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.apache.kafka_kafka-clients-2.0.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.spark-project.spark_unused-1.0.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.lz4_lz4-java-1.4.0.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.xerial.snappy_snappy-java-1.1.7.1.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.slf4j_slf4j-api-1.7.16.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr4-4.7.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr4-runtime-4.7.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_antlr-runtime-3.5.2.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.antlr_ST4-4.0.8.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/org.glassfish_javax.json-1.0.4.jar --user-class-path file:/mnt/yarn/usercache/ec2-user/appcache/application_1686639124522_0021/container_1686639124522_0021_01_000002/com.ibm.icu_icu4j-58.2.jar > /var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002/stdout 2> /var/log/hadoop-yarn/containers/application_1686639124522_0021/container_1686639124522_0021_01_000002/stderr

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Container exited with a non-zero exit code 134

But when i reverted back all the changes, its working fine.

Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • ```ERROR YarnScheduler: Lost executor 1 on ip-192-168-133-133.us-west-1.compute.internal: Container from a bad node: container_1686639124522_0021_01_000002 on host: ip-192-168-133-133.us-west-1.compute.internal. Exit status: 134``` Is the node ```ip-192-168-133-133.us-west-1.compute.internal``` healthy? Did you test it multiple times to confirm if the failure is actually related to changing eventLog.dir to S3? – Sajjan Bhattarai Jun 14 '23 at 12:09
  • yes all the nodes was healthy. when i just reversed the changes everything was running smoothly – Ritik Kaushik Jul 18 '23 at 13:22

0 Answers0