1

I have a dataframe df that looks something like this.

+----+
| Id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
+----+

What I want to do is use this Id column in a SQL query to pull information from another table. This table I'm querying is massive so querying the entire thing and then doing a join isn't performant.

I tried something like this.

spark.sql(f"SELECT * FROM mytable WHERE Id IN {df.Id}")

but that doesn't seem to work. Is there another way to do what I'm attempting here?

Josh
  • 718
  • 2
  • 15
  • 38

1 Answers1

1

You can use a broadcast join. If you broadcast the df dataframe it would be no problem if mytable is large:

result=spark.sql("""select /*+  BROADCASTJOIN(df) */ *   
                from mytable join df on mytable.id = df.id""")
werner
  • 13,518
  • 6
  • 30
  • 45