3

Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.

For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.

What is the best way to do this?

Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]

Does Broadcast will make sure the HashMap always contains the value.

Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]

Update:

Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?

When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.

Shankar
  • 8,529
  • 26
  • 90
  • 159
  • **"Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose"**. So, are you saying that you are trying to create a broadcast variable for every transformation?! PD. I suggest you to post a small code example, so I can understand better your problem, actually it is a bit obscure. – Alberto Bonsanto Feb 15 '16 at 14:32

2 Answers2

4

First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.

As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

For example.

rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})

rdd = rdd.map(lambda x: x + broadcast.value["value"]) 
rdd.collect()

As you can see I access the value inside the dictionary in every iteration of the transformation.

Community
  • 1
  • 1
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • When i call Broadcast.value inside transformation, will it return the same HashMap or for each row it will create new HashMap? – Shankar Feb 15 '16 at 14:00
  • I have updated my question with what i am looking exactly..please check it. – Shankar Feb 15 '16 at 14:19
0

You should broadcast the variable. Making the variable static will cause the class to be serialized and distributed and you might not want that.

Tomer
  • 552
  • 1
  • 6
  • 21