Apache Spark: reading RDD from Spark Cluster

Question

I have an RDD in Spark cluster. On client side I call collect(), then create a java stream from collected data and create a CSV file from this stream.

When I call collect() on RDD I bring all the data into memory on client side that is something I try to avoid. Is there any way to get RDD from Spark cluster as a stream?

I have a requirement not to bring logic that creates CSV to Spark cluster and keep it on client side.

I am using Standalone cluster and Java API.

(Premise: I have not downvoted this question) I think that you should read 2-3 times your question again and rephrase it. It's pretty hard to understand what you want to do. If I understood the problem correctly, you are aggregating a bunch of data (millions of objects), but this is to big to keep it in memory - and now you would like to process data in chunks. Is this right? How many nodes do you have? What is your setup? How do you partition your data? What are you exactly trying to do? Could you post a few lines of code? — Markon, Nov 17 '15 at 15:52

score 0 · Accepted Answer · edited May 23 '17 at 12:07

I am no expert but I think I see what you are asking. Please post some code to help up get it better, if you can.

For now there are operations that work on a per-partition basis but I don't know if that's going to get you home, see toLocalIterator from the first answer on this question: Spark: Best practice for retrieving big data from RDD to local machine

You can control the number of partitions (per node I believe) with the second parameter to parallelize, "slices" but it's not documented well. Pretty sure if you search for partition on the Spark Programming Guide you'll get a good idea.

http://spark.apache.org/docs/latest/programming-guide.html

and oh yeah don't call collect() defeats the whole purpose! – JimLohse Nov 28 '15 at 00:24 — JimLohse, Nov 28 '15 at 00:24

Apache Spark: reading RDD from Spark Cluster

1 Answers1