I'm trying to join two Dataframes, one is around 10 million records and the other is about 1/3 of that. Since the small DataFrame fits comfortably in the executor's memory, I perform a broadcast join and then write out the result:
val df = spark.read.parquet("/plablo/data/tweets10M")
.select("id", "content", "lat", "lon", "date")
val fullResult = FilterAndClean.performFilter(df, spark)
.select("id", "final_tokens")
.filter(size($"final_tokens") > 1)
val fullDFWithClean = {
df.join(broadcast(fullResult), "id")
}
fullDFWithClean
.write
.partitionBy("date")
.mode(saveMode = SaveMode.Overwrite)
.parquet("/plablo/data/cleanTokensSpanish")
After a while, I get this error:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:125)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FileSourceScanExec.consume(DataSourceScanExec.scala:141)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduceVectorized(DataSourceScanExec.scala:392)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduce(DataSourceScanExec.scala:315)
.....
There's this question that addresses the same issue. In the comments, it's mentioned that increasing spark.sql.broadcastTimeout
could fix the problem, but after setting a large value (5000 seconds) I still get the same error (although much later, of course).
The original data is partitioned by date
column, the function that returns fullResult
performs a series of narrow transformations and filters the data so, I'm assuming, the partition is preserved.
The Physical Plan confirms that spark will perform a BroadcastHashJoin
*Project [id#11, content#8, lat#5, lon#6, date#150, final_tokens#339]
+- *BroadcastHashJoin [id#11], [id#363], Inner, BuildRight
:- *Project [id#11, content#8, lat#5, lon#6, date#150]
: +- *Filter isnotnull(id#11)
: +- *FileScan parquet [lat#5,lon#6,content#8,id#11,date#150]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M],
PartitionCount: 182, PartitionFilters: [], PushedFilters:
[IsNotNull(id)], ReadSchema:
struct<lat:double,lon:double,content:string,id:int>
+- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
+- *Project [id#363, UDF(UDF(UDF(content#360))) AS
final_tokens#339]
+- *Filter (((UDF(UDF(content#360)) = es) && (size(UDF(UDF(UDF(content#360)))) > 1)) && isnotnull(id#363))
+- *FileScan parquet [content#360,id#363,date#502] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M], PartitionCount: 182, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<content:string,id:int>
I believe that, given the size of my data, this operation should be relatively fast (on 4 executors with 5 cores each and 4g RAM running on YARN in cluster mode).
Any help is appreciated