Processing Hive Lookup tables in Spark vs Spark Broadcast variables

Question

i have a two data sets names dataset1 and dataset2 and dataset1 is like

empid  empame
101    john
102    kevin

and dataset2 is like

empid  empmarks  empaddress
101      75        LA
102      69        NY

The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets. As of my knowledge, now i have two options to process these datasets:

1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark

2.By using Spark Broadcast Variables we can process these dataset.

Anyone please suggest me which one is the better option.

right now these datasets present in hive or text file ? – Ram Ghadiyaram Dec 13 '16 at 05:14 — Ram Ghadiyaram, Dec 13 '16 at 05:14

score 2 · Accepted Answer · edited Oct 26 '17 at 11:01

2

This should be better option than those 2 options mentioned.

since you have common key you can do inner join.

dataset2.join(dataset1, Seq("empid"), "inner").show()

you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.

import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()

Also Look at for more details..

DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-maximum-size-for-a-broadcast-object-in-spark

edited Oct 26 '17 at 11:01

user3190018

890
13
26

answered Dec 13 '16 at 05:23

Ram Ghadiyaram

28,239
13
95
121

added import for `broadcast` function – Ram Ghadiyaram Dec 13 '16 at 05:34
http://stackoverflow.com/questions/41137198/perform-rdd-operations-on-dataframes look at this – ROOT Dec 14 '16 at 07:42
@ Ram Ghadiyaram , can you please share your email id or skype id. So that i can explain my problem clearly – ROOT Dec 15 '16 at 05:56

Processing Hive Lookup tables in Spark vs Spark Broadcast variables

1 Answers1

Linked