I know spark can cache and persist per partition. If I want to create a cache per node to avoid network traffic, is that possible?.
Like if all customers ids processed are valid, sort of a reference integrity check!.
I know spark can cache and persist per partition. If I want to create a cache per node to avoid network traffic, is that possible?.
Like if all customers ids processed are valid, sort of a reference integrity check!.
Yes you can cache data at each node by using broadcasting variables. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
broadcastVar = sc.broadcast([1, 2, 3])
The value of the broadcast variable can be accessed by calling the value method
If you want to cache data according to the partition then you should use cache function it will save the output of RDD on which it is called locally and will send the relevant information about that RDD to mater node .