pyspark to broadcast or not to broadcast

Asked Feb 29 '16 at 13:03

Active Feb 29 '16 at 13:06

Viewed 68 times

I have implemented the same function dotproduct once with broadcast and once without

shared = [1, 2 , 3, 4, 5]
broadcasted = sc.broadcast(shared)

def dotproduct_shared(vector):
   return sum([v*w for v,w in zip(vector,shared)])

def dotproduct_broadcast(vector):
   return sum([v*w for v,w in zip(vector, broadcasted.value)])

They both work,

the question is: what is the difference ?

Why should I use broadcast ?

edited Feb 29 '16 at 13:06

Alberto Bonsanto

17,556
10
64
93

asked Feb 29 '16 at 13:03

Uri Goren

13,386
6
58
110

I believe @AlbertoBonsanto is right. It's a duplicate. – eliasah Feb 29 '16 at 13:25
It could be dupe guys but the answer is wrong I believe. – zero323 Feb 29 '16 at 14:24
@zero323, don't leave me hanging, what is the right answer? – Uri Goren Feb 29 '16 at 16:34
@UriGoren A fundamental difference is a liftime. Broadcast variables are persistent and can be reused, variables passed in closures are ephemeral and have to be resend for each stage it is using it. – zero323 Feb 29 '16 at 16:41
This question is not duplicate. Way broadcast variable is shared in Spark (Scala) and PySpark (Python) is different because of underlying executors being thread based vs process based. – chhantyal Dec 20 '17 at 16:34

pyspark to broadcast or not to broadcast

0 Answers0