Is there any way (or any plans) to be able to turn Spark distributed collections (RDD
s, Dataframe
or Dataset
s) directly into Broadcast
variables without the need for a collect
? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
sc.broadcast(myStringPairRdd.collect().toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
BONUS
Furthermore, there could be a Monoid-like API (a bit like combineByKey
) for situations where the .toMap
or whatever operation on Array[T]
is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
CLARIFICATION
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd
by no means covers the keys in myUberMap
but I don't know which keys I need until I traverse someOtherRdd
AND suppose I use myUberMap
multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.