0

We are running Spark job with reparation to number 20. I'm reading data from Kafka topic with 1 partition only, so using reparation to achieve more parallelism with executors and as well to control rate of messages.

It always show 1 Skipped job in UI. I tried to change the reparation number to 40, 15 and other as well. But it always show 1 skipped Job.

Here is code snippet for re-partition:

@Override
public void call(JavaRDD<ConsumerRecord<String, byte[]>> consumerStreamRdd) throws Exception {
                OffsetRange[] offsetRanges = ((HasOffsetRanges) consumerStreamRdd.rdd()).offsetRanges();
                JavaRDD<String> jsonRdd = consumerStreamRdd.repartition(20).map(new Function<ConsumerRecord<String, byte[]>, String>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public String call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {}

I have following questions:

  1. Does it has any impact e.g. data loss ?

  2. How I can avoid these skipped Jobs?

Spark Skipped Jobs

Here is Spark Configurations:

#!/bin/bash

export SPARK_MAJOR_VERSION=2

# Minimum TODOs on a per job basis:
# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)
# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings

# the two most important settings:
num_executors=4
executor_memory=16g

# 3-5 cores per executor is a good default balancing HDFS client throughput vs. JVM overhead
# see http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
executor_cores=2

# backpressure
reciever_minRate=1
receiver_max_rate=10
receiver_initial_rate=10

/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --master yarn --deploy-mode cluster \
  --name production \
  --class com.Data \
  --driver-memory 16g \
  --num-executors ${num_executors} --executor-cores ${executor_cores} --executor-memory ${executor_memory} \
  --files log4j-yarn-warid-br1-ccn-data.properties \
  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-data.properties" \
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-data.properties" \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer `# Kryo Serializer is much faster than the default Java Serializer` \
  --conf spark.kryoserializer.buffer.max=1g \
  --conf spark.locality.wait=30 \
  --conf spark.task.maxFailures=8 `# Increase max task failures before failing job (Default: 4)` \
  --conf spark.ui.killEnabled=true `# Prevent killing of stages and corresponding jobs from the Spark UI` \
  --conf spark.logConf=true `# Log Spark Configuration in driver log for troubleshooting` \
`# SPARK STREAMING CONFIGURATION` \
  --conf spark.scheduler.mode=FAIR \
  --conf spark.default.parallelism=32 \
  --conf spark.streaming.blockInterval=75 `# [Optional] Tweak to balance data processing parallelism vs. task scheduling overhead (Default: 200ms)` \
  --conf spark.streaming.receiver.writeAheadLog.enable=true `# Prevent data loss on driver recovery` \
  --conf spark.streaming.backpressure.enabled=false \
  --conf spark.streaming.kafka.maxRatePerPartition=${receiver_max_rate} `# [Spark 1.x]: Corresponding max rate setting for Direct Kafka Streaming (Default: not set)` \
`# YARN CONFIGURATION` \
  --conf spark.yarn.driver.memoryOverhead=10240 `# [Optional] Set if --driver-memory < 5GB` \
  --conf spark.yarn.executor.memoryOverhead=10240 `# [Optional] Set if --executor-memory < 10GB` \
  --conf spark.yarn.maxAppAttempts=4 `# Increase max application master attempts (needs to be <= yarn.resourcemanager.am.max-attempts in YARN, which defaults to 2) (Default: yarn.resourcemanager.am.max-attempts)` \
  --conf spark.yarn.am.attemptFailuresValidityInterval=1h `# Attempt counter considers only the last hour (Default: (none))` \
  --conf spark.yarn.max.executor.failures=$((8 * ${num_executors})) `# Increase max executor failures (Default: max(numExecutors * 2, 3))` \
  --conf spark.yarn.executor.failuresValidityInterval=1h `# Executor failure counter considers only the last hour` \
  --conf spark.task.maxFailures=8 \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20 -XX:MaxGCPauseMillis=800" \
  --conf spark.speculation=false \
/home/runscripts/production.jar
halfer
  • 19,824
  • 17
  • 99
  • 186
Imran
  • 5,376
  • 2
  • 26
  • 45
  • @eliasah I have added code as well, hopefully it will help you to answer the question. Waiting for expert opinion here. – Imran Nov 09 '17 at 13:12
  • Thanks for comment @user8371915 how i can avoid this situation ? And what is impact of skipped job, – Imran Nov 09 '17 at 13:36

0 Answers0