0

BackGround:

  1. Our project is build on PlayFrameWork.
  2. Front-end language: JavaScript
  3. Back-end language: Scala
  4. we are develope a web application,the server is a cluster.

Want to achieve:

  1. In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
  2. When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
  3. We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)

What we have done:

  • We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
  • Then "Build Artifacts"->"Bulid" to get a fat jar.
  • Then submit by two approaches:

    1. "spark-submit --class main.scala.Test --master yarn /path.jar"

    2. run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).

  • When program execution finished, we get the processed_data and store it.

  • Then read the processed_data's path and pass it to front-end.

All are not user interactively submit. Very stupid!

So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?

Thanks!

Chuang
  • 25
  • 2
  • 10
  • Do you have some set of pre-defined queries or use can create his own queries? Or maybe user can configure somehow? – Vladislav Varslavans Mar 08 '18 at 10:31
  • Yes,these queries are pre-defined. User only need to input some paramters and press submit button. For example, we use clustering algorithms to find cluster of spatial data set(like K-means). User only need to input the number of cluster. then the other things will be finished by back-end on cluster. Of course, It's better if user can define query or configure spark themselves. – Chuang Mar 08 '18 at 10:42

2 Answers2

0

So generally you have two approaches:

  • Create Spark application that will also be a web service
  • Create Spark application that will be called by a web service

First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.

Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.

Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
  • Oh, thank. I think your mean is running script about spark pack and submit on java code. I do not know if I understand the right? – Chuang Mar 08 '18 at 12:51
  • Scala runs on JVM same as Java. That means that you will just create Scala web server app + Scala+Spark apps. Each will be a jar file. And then you just launch jars (`java -jar` for web server and `spark-submit` from web server to submit Spark jobs). – Vladislav Varslavans Mar 08 '18 at 13:13
0

Here is multiple ways to submit a spark job:

  1. using spark-submit command on terminal.
  2. using spark built-in rest API. you can click to find out how to use it.
  3. providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.

**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **

  • Sounds this approache only could be used on querying. It's an API. but I still need to process data,not just query from database.(If something wrong, please forgive, i don't know too much about how api works.) – Chuang Mar 08 '18 at 12:54
  • as a matter of fact, three above solutions all are about submitting a job regardless of the type of your job(spark core, spark sql or ...). you can submit any kind of job. And I did not mentioned anything about querying on databases. So, you can use all three solutions to have a job submission on your spark cluster. – Amin Heydari Alashti Mar 08 '18 at 22:08