I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
.toDF
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = partition.map(record => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
})
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
Thanks!