Questions tagged [spark-checkpoint]

51 questions
9
votes
2 answers

What is the difference between spark checkpoint and local checkpoint?

What is the difference between spark checkpoint and local checkpoint? When making local checkpoint I see this in the spark UI: It shows that local checkpoint is saved on memory.
Shadowtrooper
  • 1,372
  • 15
  • 28
8
votes
2 answers

Spark Structure Streaming fail duo to checkpoint file not found

I am running spark structured streaming on a test env. It happens from time to time that the job fail duo to some checkpoint file is not found. One reason might be that the kafka topic has a very short retention time. But I've added…
7
votes
1 answer

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it. My questions are: Should I specifiy the checkpoint dir? Is it possible to do it like this: df.checkpoint() Are there any optional params…
6
votes
0 answers

How to read a checkpoint Dataframe in Spark Scala

I am trying to test below program to take the checkpoint and read if from checkpoint location if in case application fails due to any reason like resource unavailability. When I kill the job and retrigger it again, execution restarts from beginning.…
NRC
  • 83
  • 1
  • 5
4
votes
0 answers

How to figure out Kafka startingOffsets and endingOffsets in a scheduled Spark batch job?

I am trying to read from a Kafka topic in my Spark batch job and publish to another topic. I am not using streaming because it does not fit our use case. According to the spark docs, the batch job starts reading from the earliest Kafka offsets by…
3
votes
2 answers

Iterative caching vs checkpointing in Spark

I have an iterative application running on Spark that I simplified to the following code: var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000)) var c: Long = Int.MaxValue var iteration: Int = 0 while (c > 0) { iteration += 1 …
w4bo
  • 855
  • 7
  • 14
3
votes
1 answer

Spark not able to find checkpointed data in HDFS after executor fails

I am sreaming data from Kafka as below: final JavaPairDStream transformedMessages = rtStream .mapToPair(record -> new Tuple2(record.key(), record.value())) …
Amanpreet Khurana
  • 549
  • 1
  • 5
  • 17
2
votes
1 answer

dataproc spark checkpoint best practices? what should I set the checkpoint dir too?

I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints() Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a…
AEDWIP
  • 888
  • 2
  • 9
  • 22
2
votes
1 answer

How to store Spark Streaming Checkpoint Location into S3?

I am interested in a Spark Streaming app (Spark v2.3.2) that sources S3 parquet data and writes parquet data to S3. The app's data frame stream makes use of groupByKey() and flatMapGroupsWithState() to make use of the GroupState. Is it possible to…
2
votes
0 answers

Spark streaming throwing errror of checkpoint after >10 min of run

I am executing the Streaming job with SQS on EMR, however after 10 min of run it starts throwing the error in background (Application still runs though), causing a lot of noise in logs. 2019-12-09 04:00:00,391 ERROR [JobGenerator]…
Sachin
  • 3,424
  • 3
  • 21
  • 42
2
votes
1 answer

Spark streaming SQS with checkpoint enable

I have went through multiple sites like…
Sachin
  • 3,424
  • 3
  • 21
  • 42
2
votes
1 answer

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following: In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this…
Chris Bedford
  • 2,560
  • 3
  • 28
  • 60
2
votes
0 answers

How do I use EMRFS for checkpointing with Structured Streaming?

I have been using S3 for checkpointing with Structured Streaming. However I am getting the FileNotFound Exception related to eventual consistency in S3. Below is what I currently have with S3 checkpointing. val msg =…
fledgling
  • 991
  • 4
  • 25
  • 48
2
votes
0 answers

Spark Checkpointing: Content, Recovery and Idempotency

I am trying to understand the content of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list: medium post SO Spark docs the very…
Sheel Pancholi
  • 621
  • 11
  • 25
2
votes
1 answer

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as: from pyspark.sql import SparkSession from pyspark.sql.functions import * df = spark.table("emp.emptable") here I am adding a new column with current date from system to the…
vikrant rana
  • 4,509
  • 6
  • 32
  • 72
1
2 3 4