Spark Sql join - works when joined on a certain column, doesn't work when joined on other column

Question

I have 2 tables, A and B in sql server. Here's the schema

Table A (
  col1 varchar(20) not null, --is an FK to another table (NOT Table B)
  col2 varchar(20) not null, --is the PK
  .
  .
  .
  other columns
)

Table B (
  col1 varchar(20) not null,
  col2 varchar(20) not null,
  .
  .
  .
  other columns
)

I create dataframes out of the two tables and join them.

A.join(B, A("col1") === B("col1"))  works fine
A.join(B, A("col2") === B("col2"))  gives no rows

This behavior is consistent. But, doing a sql query gives results in both cases.

On examining the logs, I figured out that for the failure case, after the DAGScheduler finishes its job, the ClosureCleaner kicks in and messes things up. In the successful case, ClosureCleaner does NOT kick in at this point.

Failure case:

INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: show at TestEventJoins.scala:65, took 12.504064 s

DEBUG org.apache.spark.util.ClosureCleaner - +++ Cleaning closure (org.apache.spark.sql.execution.SparkPlan$$anonfun$3) +++

Successful case:

INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: show at TestEventJoins.scala:65, took 13.523901 s

DEBUG org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection - code for createexternalrow.....

Any help on what's going on here and how I can make this work?

I'm afraid I cannot provide a verifiable example. I've tried to reproduce this with sample tables, but couldn't. And I can't give you access to the said tables :(.

If anyone can help me understand what makes the ClosureCleaner kick in (I've already gone through this) in this particular case or any hints regarding any specific character of the column/data residing in the columns that might be causing this...

This question needs a minimum verifiable and complete example. We can't help you without that... — eliasah, Aug 16 '17 at 15:40
I'm afraid I cannot provide a verifiable example. I've tried to reproduce this with sample tables, but couldn't. And I can't give you access to the said tables. — Abira, Aug 17 '17 at 05:21

Spark Sql join - works when joined on a certain column, doesn't work when joined on other column

0 Answers0