2

I have used Spark to build a machine learning pipeline, which takes a job XML file as an input where users can specify data, features, models and their parameters. The reason for using this job XML input file is that users can simply modify their XML file to config the pipeline and do not need to re-compile from the source code. However, currently the Spark job is typically packaged into an uber-Jar file, and it seems that there is no way to provide additional XML inputs when the job is submitted to YARN.

I wonder if there are any solutions or alternatives?

3 Answers3

1

I'd look into Spark-JobServer You can use it to submit your job to a Spark Cluster together with a configuration. You might have to adapt your XML to the JSON format used by the config or maybe encapsulate it somehow.

Here's an example on how to submit a job + config:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
  "status": "STARTED",
  "result": {
    "jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
    "context": "b7ea0eb5-spark.jobserver.WordCountExample"
  }
}
maasg
  • 37,100
  • 11
  • 88
  • 115
  • You can also 'cache' RDDs using the `NamedRDD` feature, which is very useful to hold pre-computed models in memory and then just run queries on them. Don't forget to accept and close the question if you find it's answered. – maasg Jun 20 '14 at 17:00
0

You should use the resources directory to place the xml file if you want it to be bundled with the jar. This is a basic Java/Scala thing.

Suggest reading: Get a resource using getResource()

To replace the xml in the jar without rebuilding the jar: How do I update one file in a jar without repackaging the whole jar?

Community
  • 1
  • 1
samthebest
  • 30,803
  • 25
  • 102
  • 142
  • Of course if xml file is created before Maven build, then definitely one can include them in the resource directory. But here I want to avoid building again: As a user of the pipeline, I just want to use XML to specify components to be used and their properties during runtime, and I don't want to rebuild the jar file when I want to specify a XML. – Jerrysdevil Jun 20 '14 at 16:31
  • Fair point, I've updated my answer to address your point. I think it's just a couple of lines of bash to update a jar file. – samthebest Jun 21 '14 at 11:00
  • That is also a good file. So instead of dynamically specifying a file during runtime, just use a static file and wrap with a bash. So that every time before submitting the job, use a bash script to replace the file. Thanks. – Jerrysdevil Jun 22 '14 at 06:11
0

The final solution that I used to solve this problem is:

  1. Store the XML file in HDFS,

  2. Pass in the file location of the XML file,

  3. Use the InputStreamHDFS to directly read from HDFS: val hadoopConf = sc.hadoopConfiguration val jobfileIn:Option[InputStream] = inputStreamHDFS(hadoopConf, filename) if (jobfileIn.isDefined) { logger.info("Job file found in file system: " + filename) xml = Some(XML.load(jobfileIn.get)) }