I am facing some issue while running a spark java program that reads a file, do some manipulation and then generates output file at a given path.
Every thing works fine when master and slaves are on same machine .ie: in Standalone-cluster mode.
But problem started when I deployed same program in multi machine multi node cluster set up. That means the master is running at x.x.x.102
and slave is running on x.x.x.104
.
Both the master -slave have shared their SSH keys and are reachable from each other.
Initially slave was not able to read input file , for that I came to know I need to call sc.addFile()
before sc.textFile()
. that solved issue. But now I see output is being generated on slave machine in a _temporary folder under the output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
In local cluster mode it works fine and generates output file in /tmp/emi/part-00000
.
I came to know that i need to use SparkFiles.get()
. but i am not able to understand how and where to use this method.
till now I am using
DataFrame dataobj = ...
dataObj.javaRDD().coalesce(1).saveAsTextFile("file:/tmp/emi");
Can any one please let me know how to call SparkFiles.get()
?
In short how can I tell slave to create output file in the machine where driver is running?
Please help.
Thanks a lot in advance.