I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id
What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe
I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:
- Broadcast join the 2 dataframes
- Broadcast the small dataframe and collect it as an array of strings and then filter on the big dataframe and use isin with the array of strings
Are there any other options that I didn't mention here?
I'll appreciate it if someone could also explain why a specific solution is more efficient than the other
Thanks in advance