1

I know there's been some questions about Spark's temporary files like this one but I can't find one that answers my question.

I am using Spark 1.6.0 in a standalone mode and I run it under Windows, so when I set SPARK_LOCAL_DIRS on each worker, this gives the information where the temporary files will be written. Nonetheless, I get a strange behavior with snappy. Indeed whatever I tried, each executor writes a copy of snappy's dll into my C:\Windows directory (that gets really poluted). The piece of code that's supposed to deal with temporary files in Spark is:

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
   ...
   else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
      conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
   } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
      conf.getenv("SPARK_LOCAL_DIRS").split(",")
   } ... (stuffs on mesos)
   } else {
      // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
      // configuration to point to a secure directory. So create a subdirectory with restricted
     // permissions under each listed directory.
     conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
    }
}

and I tried any combination of those but I always have my snappy-1.1.2-*-snappyjava.dll on C:\Windows (I think I get this because this is the java.io.tmpdir).

Does someone know how to set the temporary directory where the executors write down the dlls? Thanks.

EDIT. It's indeed due to the property java.io.tmpdir, and I can change it with:

val opt = "-Djava.io.tmpdir=myPath"
conf.set("spark.executor.extraJavaOptions", opt)

but, unfortunately this makes this all the same for each executor on any machine.

Community
  • 1
  • 1
Vince.Bdn
  • 1,145
  • 1
  • 13
  • 28

1 Answers1

2

So it looks like there's a missing implementation for chosing the directory of copy of Snappy's dll in Spark 1.6.0 standalone, Windows (or maybe it's meant for but that'd be weird as it's not cleaned afterwards...). It always uses java.io.tmpdir as serialization directory, so if one want to set it. It should be set both on the worker side, e.g. launch the worker's JVM with option -Djava.io.tmpdir=myPath and on the driver side (same as with the worker with the JVM that launches the application).

Vince.Bdn
  • 1,145
  • 1
  • 13
  • 28
  • 1
    Depending on how you deploy, it may be you need to use `--driver-java-options` in this way `--driver-java-options -Djava.io.tmpdir=myPath`. That ought to be all for yarn-client mode. Yarn itself will govern the nodemanager's temp dirs. – vpipkt Apr 12 '17 at 18:22