Save a spark RDD to the local file system using Java

Question

I have a RDD that is generated using Spark. Now if I write this RDD to a csv file, I am provided with some methods like "saveAsTextFile()" which outputs a csv file to the HDFS.

I want to write the file to my local file system so that my SSIS process can pick the files from the system and load them into the DB.

I am currently unable to use sqoop.

Is it somewhere possible in Java other than writing shell scripts to do that.

Any clarity needed, please let know.

Not sure about any Spark method to do this.. But you can always open a fileoutputstream, iterate over RDD and save it in the file. The plain old java way ! — Pranav Maniar, Jul 06 '15 at 07:03
Hey what path are you using in saveAsTextFile() method ?? can you provide some code snippet.. — Pranav Maniar, Jul 06 '15 at 07:05
I have tried the following paths: - "hdfs://hadoop/bigdata/" This saves the file to hdfs - Also, tried with this where I copied the absolute file path "/kanav/output/". This returns with no error but also does not create any file. — Kanav Sharma, Jul 06 '15 at 07:52
absolute path should start with file:/// as shown in the below answer — Pranav Maniar, Jul 06 '15 at 07:56

score 16 · Accepted Answer · edited Dec 14 '19 at 02:07

16

saveAsTextFile is able to take in local file system paths (e.g. file:///tmp/magic/...). However, if your running on a distributed cluster, you most likely want to collect() the data back to the cluster and then save it with standard file operations.

edited Dec 14 '19 at 02:07

lampShadesDrifter

3,925
8
40
102

answered Jul 06 '15 at 07:56

Holden

7,392
1
27
33

3

okay. this method of passing the parameter with "file:///" returns successfully with a _SUCCESS file but no output files could be seen. I am running it on a distributed cluster, however my data is so much that calling collect() limits the JVM – Kanav Sharma Jul 06 '15 at 09:31
If your file is too big for one machine this does not really make much sense to saive it locally instead of hdfs or other distributed file system. – abalcerek Jul 06 '15 at 11:17
Not the file size but the files count is pretty much. My process is actually designed to handle around 400GB of data per hour. @holden I have, for now, managed to do this using FileSystem.copyToLocalFile(). I have to check it for a day for reliability and I would have more information. – Kanav Sharma Jul 06 '15 at 11:55
@holden Let me know if the approach I am on needs modification. – Kanav Sharma Jul 06 '15 at 12:04
2

If your data is too big for the driver, then you will need to either store the data to HDFS (or similar distributed file system) - or if you still really want to store it on the driver then using toLocalIterator (but remember to cache the RDD before hand) will only need as much memory as the largest partition. – Holden Jul 06 '15 at 18:26
2

Missing the code to save this using standard file operations in this answer. – user239558 Sep 16 '15 at 12:10

Save a spark RDD to the local file system using Java

1 Answers1

Linked

Related