2

I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.

However I couldn't find the broadcast function in the SparkR documentation.

So I'm wondering if you can do a broadcast join with SparkR?

iehrlich
  • 3,572
  • 4
  • 34
  • 43

1 Answers1

3

Spark 2.3: There will be broadcast function created in this pull request: https://github.com/apache/spark/pull/17965/files

Spark 2.2:

You can provide custom hint to query:

head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))

Reference: this code: https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L3740

Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.

Side note: Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:

spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")

Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61