I'm running Spark on a node with 4 disks (/mnt1, /mnt2, /mnt3, /mnt4). I want to write my temporary output from executors to a local directory. Is there any way to uniformly assign each of these disks to executors, so that all disks are uniformly used ? Currently, all data it written to /mnt1 from "forEachParition" action.
Asked
Active
Viewed 858 times
1
-
Please include configuration and what you do in `foreachPartition` – Oct 25 '16 at 23:02
-
Nothing, I just need to write partition data locally after some conversion and then upload it to S3. Due to some legacy reasons, I can't use spark's saveAsTextFile etc. Currently, I'm specifying /tmp which is mounted on /mnt1 and was wondering if there is a concept of working directory per executor (balanced cross all disks e,g, executor1 -> /mnt1/workdirectory1, executor2 -> /mnt2/workdirectory2 etc.). Otherwise I will have to uniformly and randomly pick one disk in the code – user401445 Oct 25 '16 at 23:10
-
This doesn't directly answer your question, but can you write to HDFS instead of the local filesystem? HDFS is automatically configured on EMR to stripe across all of the /mnt* disks. – Jonathan Kelly Oct 26 '16 at 17:04
-
@JonathanKelly In my use case, I need to pick a specific location for writing data, I can not save the whole RDD using saveAsTextFile. Do you know how to do that with HDFS? For e.g. If I pick /tmp it will be mounted to /mnt . Best would be in case Yarn assigns worker directory and I could write to that, but I'm not sure if that is possible – user401445 Oct 26 '16 at 17:52
-
YARN container working directories are also striped across all of the /mnt* disks, so yes, you might be able to write directly to the executor's current working directory. I have not tried this though. – Jonathan Kelly Oct 26 '16 at 17:55
-
@JonathanKelly Thanks. Do you know of any way to find executor's current working directory? I was only able to find param 'yarn.nodemanager.local-dirs' which I guess gives all working directories, but I'm not sure how to access container's working directory – user401445 Oct 26 '16 at 18:08
-
Just get the JVM's current working directory (see http://stackoverflow.com/questions/4871051/getting-the-current-working-directory-in-java). In short, it's System.getProperty("user.dir") – Jonathan Kelly Oct 26 '16 at 18:09