1

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.

As I am using below TypeSafe to load my property file.

 <groupId>com.typesafe</groupId>
    <artifactId>config</artifactId>
    <version>1.3.1</version>

In my code I am using

public static Config loadEnvProperties(String environment) {
      Config appConf = ConfigFactory.load();  // loads my "resouces" folder "application.properties" file
      return  appConf.getConfig(environment);
  }

To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below

spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor  \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
  migration-0.0.1.jar sit 

I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.

1) In the above shell script if I keep

--files /local/apps/log4j.properties,  /local/apps/applicationNew.properties \

Error :

Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
        at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)

So what is wrong here ?

2) Then i changed above script like shown i.e.

  --files /local/apps/log4j.properties \
    --files /local/apps/applicationNew.properties \

when I run spark job then I will get following error.

19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
        at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)

So what is wrong here ? why not loading the applicationNew.properties file ?

3) When I debugged it as below i.e. printed "config.file"

String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss); 

Error :

19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'

So how to set "config.file" option from spark-submit ?

How to fix above errors and load properties from external applicationNew.properties file ?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
  • 1
    try this, `--driver-java-options -Dconfig.file=./path/conf.file` – Lamanus Aug 02 '19 at 16:36
  • @Lamanus which version of spark are you using ? mine is 2.4.1 ... nope same error 19/08/02 16:40:12 ERROR Driver: config.file : null 19/08/02 16:40:12 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit' – BdEngineer Aug 02 '19 at 16:42
  • 2.4.x worked but mine has only one config file. --files did not work in my case. – Lamanus Aug 02 '19 at 16:44
  • @Lamanus I can remove other --file to check if it works , what about executor java options? – BdEngineer Aug 02 '19 at 16:52
  • I didnt and never seen that kind of option yet. – Lamanus Aug 02 '19 at 17:01
  • @Lamanus I kept only one file , still no luck ... Driver: config.file : null 19/08/02 19:03:20 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit' com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit' – BdEngineer Aug 02 '19 at 19:08
  • @Lamanus what option you are referring ? – BdEngineer Aug 02 '19 at 19:09

2 Answers2

2

--files and SparkFiles.get

With --files you should access the resource using SparkFiles.get as follows:

$ ./bin/spark-shell --files README.md

scala> import org.apache.spark._
import org.apache.spark._

scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md

In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.

getResourceAsStream(resourceFile) and InputStream

The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:

this.getClass.getClassLoader.getResourceAsStream(resourceFile)

With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.

I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.


You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • thank you , but everytime I put the files in jar , how it would be externalizing the property files ? – BdEngineer Aug 05 '19 at 07:34
  • Every file of the Spark app is distributed by `spark-submit` automatically (that's one of the features of Spark) so you don't have to worry about it. The question is then how to access the files (inside or outside jar files) in a code that expect it on a file system in a given location. That's what `SparkFiles.get` does. – Jacek Laskowski Aug 05 '19 at 09:22
  • I don't think that `SparkFiles.get` is the only way. From what I see when running jobs on various environments, if a file is distributed via the `--files` option, it is available in the current working directory of the job being run, so a relative path would work correctly. – Vladimir Matveev Aug 05 '19 at 23:58
  • @Shyam _"everytime I put the files in jar , how it would be externalizing the property files ?"_ It's only now when I understood your question. Spark does not re-package your jar files. They are available as is, but your question was how to access the resource files and one way is to use `SparkFiles.get` or...see my updated answer :) – Jacek Laskowski Aug 06 '19 at 09:03
2

The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):

--files /local/apps/log4j.properties,/local/apps/applicationNew.properties

If file names themselves have spaces in it, you should use quotes to escape these spaces:

--files "/some/path with/spaces.properties,/another path with/spaces.properties"

Another issue is that you specify the same property twice:

...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...

There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:

--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"

Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.

(I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)

Vladimir Matveev
  • 120,085
  • 34
  • 287
  • 296
  • any advice how to do this ... https://stackoverflow.com/questions/57479581/how-to-do-this-scenario-in-spark-streaming?noredirect=1#comment101437596_57479581 – BdEngineer Aug 14 '19 at 03:40
  • @Jacek Laskowski any advice how to do this ... https://stackoverflow.com/questions/57479581/how-to-do-this-scenario-in-spark-streaming?noredirect=1#comment101437596_57479581 – BdEngineer Aug 14 '19 at 04:15
  • can you help and suggest how to handle this https://stackoverflow.com/questions/62036791/while-writing-to-hdfs-path-getting-error-java-io-ioexception-failed-to-rename – BdEngineer May 27 '20 at 06:36