1

I am looking how to copy a folder with files of resource dependencies from HDFS to a local working directory of each spark executor using Java.

I was at first thinking of using --files FILES option of spark-submit but it seems it does not support folders of files of arbitrary nesting. So, it appears I have to do it via putting this folder on a shared HDFS path to be copied correctly by each executor to its working directory before running a job but yet to find out how to do it correctly in Java code.

Or zip/gzip/archive this folder, put it on shared HDFS path, and then explode the archive to local working directory of each Spark executor.

Any help or code samples is appreciated.

This is a folder of config files and they are a part of compute and should be co-located with spark-submit main jar (eg database files, which jar code is using when running a job and I unfortunately can not change this dependency as I am reusing existing code).

Regards, -Yuriy

YuGagarin
  • 341
  • 7
  • 20
  • Spark executors running over YARN will be moved to node/rack with the data themselves. That's the fundamental application of using Hadoop - move the computation to the data – OneCricketeer Oct 01 '17 at 17:41
  • @cricket_007 I understand the concept of moving compute to data, but in this instance this is not the case. The folder (files) I am referring to are not data per se in traditional Hadoop sense - these are config files and are a part of compute (eg database files, which jar code is using when running a job and I unfortunately can not change this dependency as I am reusing existing code). – YuGagarin Oct 01 '17 at 17:45
  • @cricket_007 These are config files and they are a part of compute and should be co-located with spark-submit main jar (eg database files, which jar code is using when running a job and I unfortunately can not change this dependency as I am reusing existing code). – YuGagarin Oct 01 '17 at 17:51
  • Okay, then `--files` parameter is needed. A gzipped folder, ideally – OneCricketeer Oct 01 '17 at 18:10
  • @cricket_007 Does --files support copying folders of arbitrary nesting structure? I could not confirm it does. – YuGagarin Oct 01 '17 at 18:12
  • It'll support archive files, not folders. I've not had much experience using it. Alternatively, you upload the files to a NFS or HDFS shared location – OneCricketeer Oct 01 '17 at 18:15
  • @cricket_007 if I put archive on --files how would I explode/extract its content at. run-time? I can not just copy archive as code expects resource dependencies be there in a certain folders structure. Also, if I put it on HDFS I need a way to copy it locally to each executor working directory as per my original question. Thanks! – YuGagarin Oct 01 '17 at 18:36
  • Well, similar to files, there's rather archives, which will automatically extract it for you. https://stackoverflow.com/questions/41498365/upload-zip-file-using-archives-option-of-spark-submit-on-yarn – OneCricketeer Oct 01 '17 at 18:47
  • @cricket_007 Thanks for great pointer! Will give it a try – YuGagarin Oct 01 '17 at 18:53
  • @cricket_007 I feel like my archives are not being copied over and extracted by Yarn. Where (which logs) can I check for that? – YuGagarin Oct 01 '17 at 21:30
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155725/discussion-between-yugagarin-and-cricket-007). – YuGagarin Oct 01 '17 at 21:38
  • You would check the YARN UI or the Spark History Server. I've never used Azure, so I don't know how you'd get there – OneCricketeer Oct 01 '17 at 22:30
  • I did check YARN UI and there is no archive being copied from what I can tell. I do see local master jar being copied over to hdfs but that is it.. – YuGagarin Oct 01 '17 at 22:56
  • I guess I was wrong. The problem was placing --archive or --file into a wrong place AFTER .jar file name. It should be before .jar name in the spark-submit... – YuGagarin Oct 03 '17 at 16:14
  • Right. `--archive ` is the parameter – OneCricketeer Oct 03 '17 at 16:18
  • Is there a way to automatically add lists from --files or --archives to executor classpath? is it spark.executor.extraClassPath property? – YuGagarin Oct 03 '17 at 16:27
  • `spark.yarn.dist.{archives,files}`. https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties – OneCricketeer Oct 03 '17 at 19:47
  • @cricket_007 This does not add these lists to executor classpath. It copies to executor working directory, which is not on classpath by default it seems – YuGagarin Oct 04 '17 at 22:45
  • Classpath is a Java term. If you want JAR files, use `--packages`. Otherwise, I don't understand your issue. You can open files from the working directory of the executor – OneCricketeer Oct 04 '17 at 23:34
  • @cricket_007 thanks for all your answers! It is not my code. I am using third party library. It is a jar and non-code dependencies. This library jar looks for these dependencies via classpath. So I have to place these dependencies on the classpath somewhere on the executor. – YuGagarin Oct 05 '17 at 07:15
  • A JAR is simply a zip archive. If you rename to .zip, then follow this other post, then it should be extracted. https://stackoverflow.com/questions/41498365/upload-zip-file-using-archives-option-of-spark-submit-on-yarn – OneCricketeer Oct 05 '17 at 20:26

0 Answers0