9

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0

scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349

Here are the details:

enter image description here

smishra
  • 3,122
  • 29
  • 31
  • Maybe this [Spark Bug](https://issues.apache.org/jira/browse/SPARK-29170)? A workaround might be to reduce the samplingRatio when reading the Json files – werner Jun 28 '20 at 13:26
  • May be a bug or may be Spark 3.0 uses unoptimized DAG - please notice additional stages FileScanRDD and SQLExecutionRDD. I suppose these two are slowing down the processing. – smishra Jun 28 '20 at 14:54

1 Answers1

11

As it turns out default behavior of Spark 3.0 has changed - it tries to infer timestamp unless schema is specified and that results into huge amount of text scan. I tried to load the data with inferTimestamp=false time did come close to that of Spark 2.4 but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I have no idea why this behavior was changed but its should have been notified in BOLD letters.

Spark 2.4

spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 29706 ms
res0: Long = 2605349



spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 31431 ms
res0: Long = 2605349

Spark 3.0

spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 32826 ms
res0: Long = 2605349
 
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 34011 ms
res0: Long = 2605349

Note:

  • Make sure you never turn on prefersDecimal to true even when inferTimestamp is false, it again takes huge amount of time.
  • Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.
smishra
  • 3,122
  • 29
  • 31
  • I also found my pyspark code (which apply some PandasUDF) is slower on Spark 3 (EMR 6.3.0) than Spark 2.4 (EMR 5.30.0). – panc Jul 14 '21 at 05:05