-1

I know spark can cache and persist per partition. If I want to create a cache per node to avoid network traffic, is that possible?.

Like if all customers ids processed are valid, sort of a reference integrity check!.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Jagib
  • 9
  • 1
  • 8

2 Answers2

0

Yes you can cache data at each node by using broadcasting variables. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

broadcastVar = sc.broadcast([1, 2, 3])

The value of the broadcast variable can be accessed by calling the value method

SOURCE: SPARK PROGRAMMING GUIDE

None
  • 1,448
  • 1
  • 18
  • 36
  • Yes makes sense , but I was also hoping to cache what is required by that node and not the entire list. For instance if I have 100 nodes then each node to have a cache based on the partitions data on that node. – Jagib Feb 28 '17 at 22:38
  • There's also cache and persist methods, though – OneCricketeer Feb 28 '17 at 23:48
0

If you want to cache data according to the partition then you should use cache function it will save the output of RDD on which it is called locally and will send the relevant information about that RDD to mater node .

siddhartha jain
  • 1,006
  • 10
  • 16