I have 2 tables, A and B in sql server. Here's the schema
Table A (
col1 varchar(20) not null, --is an FK to another table (NOT Table B)
col2 varchar(20) not null, --is the PK
.
.
.
other columns
)
Table B (
col1 varchar(20) not null,
col2 varchar(20) not null,
.
.
.
other columns
)
I create dataframes out of the two tables and join them.
A.join(B, A("col1") === B("col1")) works fine
A.join(B, A("col2") === B("col2")) gives no rows
This behavior is consistent. But, doing a sql query gives results in both cases.
On examining the logs, I figured out that for the failure case, after the DAGScheduler finishes its job, the ClosureCleaner kicks in and messes things up. In the successful case, ClosureCleaner does NOT kick in at this point.
Failure case:
INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: show at TestEventJoins.scala:65, took 12.504064 s
DEBUG org.apache.spark.util.ClosureCleaner - +++ Cleaning closure (org.apache.spark.sql.execution.SparkPlan$$anonfun$3) +++
Successful case:
INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: show at TestEventJoins.scala:65, took 13.523901 s
DEBUG org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection - code for createexternalrow.....
Any help on what's going on here and how I can make this work?
I'm afraid I cannot provide a verifiable example. I've tried to reproduce this with sample tables, but couldn't. And I can't give you access to the said tables :(.
If anyone can help me understand what makes the ClosureCleaner kick in (I've already gone through this) in this particular case or any hints regarding any specific character of the column/data residing in the columns that might be causing this...