This is my join:
df = df_small.join(df_big, 'id', 'leftanti')
It seems I can only broadcast the right dataframe. But in order for my logic to work (leftanti join), I must have my df_small
on the left side.
How do I broadcast a dataframe which is on left?
Example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df_small = spark.range(2)
df_big = spark.range(1, 5000000)
# df_small df_big
# +---+ +-------+
# | id| | id|
# +---+ +-------+
# | 0| | 1|
# | 1| | 2|
# +---+ | ...|
# |4999999|
# +-------+
df_small = F.broadcast(df_small)
df = df_small.join(df_big, 'id', 'leftanti')
df.show()
df.explain()
# +---+
# | id|
# +---+
# | 0|
# +---+
#
# == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- SortMergeJoin [id#197L], [id#199L], LeftAnti
# :- Sort [id#197L ASC NULLS FIRST], false, 0
# : +- Exchange hashpartitioning(id#197L, 200), ENSURE_REQUIREMENTS, [id=#1406]
# : +- Range (0, 2, step=1, splits=2)
# +- Sort [id#199L ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(id#199L, 200), ENSURE_REQUIREMENTS, [id=#1407]
# +- Range (1, 5000000, step=1, splits=2)