1

i have a two data sets names dataset1 and dataset2 and dataset1 is like

empid  empame
101    john
102    kevin

and dataset2 is like

empid  empmarks  empaddress
101      75        LA
102      69        NY

The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets. As of my knowledge, now i have two options to process these datasets:

1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark

2.By using Spark Broadcast Variables we can process these dataset.

Anyone please suggest me which one is the better option.

ROOT
  • 1,757
  • 4
  • 34
  • 60

1 Answers1

2

This should be better option than those 2 options mentioned.

since you have common key you can do inner join.

dataset2.join(dataset1, Seq("empid"), "inner").show()

you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.

import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()

Also Look at for more details..

user3190018
  • 890
  • 13
  • 26
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121