As a follow-up to my previous question, how do I map over an RDD locally, i.e., collect the data into a local stream without actually using collect
(because the data is far too large).
Specifically, I want to write something like
from subprocess import Popen, PIPE
with open('out','w') as out:
with open('err','w') as err:
myproc = Popen([.....],stdin=PIPE,stdout=out,stderr=err)
myrdd.iterate_locally(lambda x: myproc.stdin.write(x+'\n'))
How do I implement this iterate_locally
?
does NOT work:
collect
return value is far too large:myrdd.collect().foreach(lambda x: myproc.stdin.write(x+'\n'))
does NOT work:
foreach
executes its argument in a distributed mode, NOT locallymyrdd.foreach(lambda x: myproc.stdin.write(x+'\n'))
Related: