How can I query a table using the data in a column of an existing dataframe in Spark?

Question

I have a dataframe df that looks something like this.

+----+
| Id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
+----+

What I want to do is use this Id column in a SQL query to pull information from another table. This table I'm querying is massive so querying the entire thing and then doing a join isn't performant.

I tried something like this.

spark.sql(f"SELECT * FROM mytable WHERE Id IN {df.Id}")

but that doesn't seem to work. Is there another way to do what I'm attempting here?

score 1 · Accepted Answer · answered Aug 12 '21 at 20:59

1

You can use a broadcast join. If you broadcast the df dataframe it would be no problem if mytable is large:

result=spark.sql("""select /*+  BROADCASTJOIN(df) */ *   
                from mytable join df on mytable.id = df.id""")

answered Aug 12 '21 at 20:59

werner

13,518
6
30
45

How can I query a table using the data in a column of an existing dataframe in Spark?

1 Answers1