1

I am working with spark 1.5.0 an amazon's EMR. I have multiple properties file that I need to use in my spark-submit program. I explored the --properties-file option. But it allows you to import properties from a single file. I need to read properties from a directory whose structure looks like :

├── AddToCollection
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json
├── CreateCollectionSuccess
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json
├── FeedCardUnlike
│   ├── query
│   ├── root
│   ├── schema
│   └── schema.json

In standalone mode I can get away with this by specifying the location of the files in the local system. But it doesn't work in cluster mode where I'm using a jar with the spark-submit command. How can I do this in spark?

nish
  • 6,952
  • 18
  • 74
  • 128
  • Hi, just curious, how did you make it on standalone mode? you specify the locations not in Spark properties file but in your application? – keypoint Oct 08 '15 at 22:45
  • @keypoint Hi. So, basically my query files have SQL queries. So, I read it as `String query = new String(Files.readAllBytes(Paths.get(configLocation + event_type + "/query" )));`. – nish Oct 08 '15 at 23:07
  • thanks, I see. Then why don't you package these query files into your big jar, so that cluster mode the workers will also be able to read these files when the jar is distributed to workers? or maybe I'm still missing your question... – keypoint Oct 08 '15 at 23:14
  • @keypoint: I'm not very good with java. I've only just started it. Please can you tell me how I can do it. I have places these under 'src/main/resources'. But that does not work. – nish Oct 08 '15 at 23:17
  • sure, I'll post an answer below – keypoint Oct 08 '15 at 23:51

2 Answers2

1

This works on Spark 1.6.1 (I haven't tested earlier versions)

spark-submit supports the --files argument that accepts a comma separated list of "local" files to be submitted along with your JAR file to the driver.

spark-submit \
    --class com.acme.Main \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 2g \
    --executor-memory 1g \
    --driver-class-path "./conf" \
    --files "./conf/app.properties,./conf/log4j.properties" \
    ./lib/my-app-uber.jar \
    "$@"

In this example I have created an Uber JAR that does not contain any properties files. When I deploy my application the app.properties and log4j.properties files are placed into the local ./conf directory.

From the source for SparkSubmitArguments it states

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.

Brad
  • 15,186
  • 11
  • 60
  • 74
0

I think you can package these files into your JAR file, and this JAR file will be submitted to Spark cluster.

For reading these files,

you can try java.util.Properties

and also refer to this Java Properties file examples

Hope it helps.

keypoint
  • 2,268
  • 4
  • 31
  • 59