Spark: Faster way to join two dataframe?

Question

I have two dataframe df1 and ip2Country. df1 contains the IP addresses and I am trying to map the ip addresses into geolocation information like longitude and latitude which are columns in ip2Country.

I am running it as a Spark-submit job, but the operations took a very long time even though df1 only has less than 2500 lines.

My code:

val agg =df1.join(ip2Country, ip2Country("network_start_int")=df1("sint")
, "inner")
.select($"src_ip"
,$"country_name".alias("scountry")
,$"iso_3".alias("scode")
,$"longitude".alias("slong")
,$"latitude".alias("slat")
,$"dst_ip",$"dint",$"count")
.filter($"slong".isNotNull)

val agg1 =agg.join(ip2Country, ip2Country("network_start_int")=agg("dint")
, "inner")
.select($"src_ip",$"scountry"
,$"scode",$"slong"
,$"slat",$"dst_ip"
,$"country_name".alias("dcountry")
,$"iso_3".alias("dcode")
,$"longitude".alias("dlong")
,$"latitude".alias("dlat"),$"count")
.filter($"dlong".isNotNull)

Is there any other way to join the two table? Or am I doing it the wrong way?

actually both are taking a long time. When I printed the sys.time, then agg1 is taking a longer time — ELI, Aug 31 '17 at 07:48
Largedf.join(broadcast(smalldf)) will work where broadcast is hint to framework — Ram Ghadiyaram, Aug 31 '17 at 07:50
yes see my answer [here](https://stackoverflow.com/a/39404486/647053) it will clearly explain why it works better. more details were explained in nice way. it you like that please vote-up. Thanks! — Ram Ghadiyaram, Aug 31 '17 at 08:25

score 11 · Accepted Answer · edited Jan 24 '19 at 07:48

11

If you have a big dataframe which needs to be joined with a small one - Broadcast joins are very effective. Read here: Broadcast Joins (aka Map-Side Joins)

bigdf.join(broadcast(smalldf))

edited Jan 24 '19 at 07:48

akki

2,021
1
24
35

answered Aug 31 '17 at 07:52

Sam Upra

737
5
12

Spark: Faster way to join two dataframe?

1 Answers1