0

I have implemented the same function dotproduct once with broadcast and once without

shared = [1, 2 , 3, 4, 5]
broadcasted = sc.broadcast(shared)

def dotproduct_shared(vector):
   return sum([v*w for v,w in zip(vector,shared)])

def dotproduct_broadcast(vector):
   return sum([v*w for v,w in zip(vector, broadcasted.value)])

They both work,

the question is: what is the difference ?

Why should I use broadcast ?

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • I believe @AlbertoBonsanto is right. It's a duplicate. – eliasah Feb 29 '16 at 13:25
  • It could be dupe guys but the answer is wrong I believe. – zero323 Feb 29 '16 at 14:24
  • @zero323, don't leave me hanging, what is the right answer? – Uri Goren Feb 29 '16 at 16:34
  • @UriGoren A fundamental difference is a liftime. Broadcast variables are persistent and can be reused, variables passed in closures are ephemeral and have to be resend for each stage it is using it. – zero323 Feb 29 '16 at 16:41
  • This question is not duplicate. Way broadcast variable is shared in Spark (Scala) and PySpark (Python) is different because of underlying executors being thread based vs process based. – chhantyal Dec 20 '17 at 16:34

0 Answers0