3

I have a scenario where i need to process the incomming requests from users in a Spark job on a 20 node cluster. The Spark application uses deep learning and does some prediction on user data which is stored on HDFS. The idea is to provide an environment like a REST webservice, to which the users can send requests and they should be processed using Spark in distributed mode on Yarn. Here are the issues:

  • When i build the jar file with dependencies, its size is more than 1gb. The deep CNN models are not embedded in the jar file.
  • Running the application via spark-submit for every incomming request seems impractical because:
    1. spark-submit has its own overhead. Resource allocation, jvm application containers assignment etc. takes time
    2. The application loads deep CNN trained models on startup, the size of one model is ~700mb and it also takes time in loading

My idea is to submit the application once using spark-submit as an infinitly running job, keep the spark context and the models in memory, and expose a REST endpoint which the users can send request to. upon recieving a request, trigger a map operation from within the running job, get the result, and return it to user in Json format. This way, they will be processed instantly without any delay. Is this possible?

I have studied many articles, and also stackoverflow questions such as Using Spark to process requests, Best Practice to launch Spark Applications via Web Application?, run spark as java web application, how to deploy war file in spark-submit command (spark), and Creating a standalone on-demand Apache Spark web service, however, none of these are fitting the scenario i described.

From the articles and stackoverflow questions, i learned that Spark REST API as well as Apache Livy can be used to submit Spark jobs, however, in both cases, a spark job is submitted for every request, which suffers from the same problems as i described above (1+ gb jar file size plus loading models on starup). Also, what happens in multiple concurrent incomming requests? Am i right?

I read that Uber uses Spark for route calculation (article, article, article), but its closed source and i have no idea how they do it on the fly for every incomming user request.

In a nutshell, is it possible to embed a REST microservice within the Spark Job using a lightweight framework such as Spark Java? Spark streaming is also not applicable in this scenario because there is no streaming data source.

I have searched this for a long time and i never found a practical solution. If my understaning of Spark REST Api and Livy is wrong, can i be corrected please? And if my idea is wrong, can you guide me what other approach is possible to get the job done? Any help or suggestions will be highly appretiated.

Raja Ayaz
  • 91
  • 12
  • According to Livy's web-site, "_features include: Have **long running Spark Contexts** that can be used for multiple Spark jobs, by multiple clients; Share **cached RDDs or Dataframes across multiple jobs** and clients..._" – mazaneicha Jun 05 '20 at 22:18

0 Answers0