I have an RDD in Spark cluster. On client side I call collect(), then create a java stream from collected data and create a CSV file from this stream.
When I call collect() on RDD I bring all the data into memory on client side that is something I try to avoid. Is there any way to get RDD from Spark cluster as a stream?
I have a requirement not to bring logic that creates CSV to Spark cluster and keep it on client side.
I am using Standalone cluster and Java API.