Let me first inform all of you that I am very new to Spark.
I need to process a huge number of records in a table and when it is grouped by email it is around 1 million. I need to perform multiple logical calculations based on the data set against individual email and update the database based on the logical calculation
Roughly my code structure is like
//Initial Data Load ...
import sparkSession.implicits._
var tableData = sparkSession.read.jdbc(<JDBC_URL>, <TABLE NAME>, connectionProperties).select("email").where(<CUSTOM CONDITION>)
//Data Frame with Records with grouping on email count greater than one
var recordsGroupedBy =tableData.groupBy("email").count().withColumnRenamed("count", "recordcount").filter("recordcount > 1 ").toDF()
//Now comes the processing after grouping against email using processDataAgainstEmail() method
recordsGroupedBy.collect().foreach(x=>processDataAgainstEmail(x.getAs("email"),sparkSession))
Here I see foreach is not parallelly executed. I need to invoke the method processDataAgainstEmail(,) in parallel. But if I try to parallelize by doing
Hi I can get a list by invoking
val emailList =dataFrameWithGroupedByMultipleRecords.select("email").rdd.map(r => r(0).asInstanceOf[String]).collect().toList
var rdd = sc.parallelize(emailList )
rdd.foreach(x => processDataAgainstEmail(x.getAs("email"),sparkSession))
This is not supported as I can not pass sparkSession when using parallelize.
Can anybody help me with this as in processDataAgainstEmail(,) multiple operations would be performed related to database insert and update and also spark dataframe and spark SQL operations needs to be performed?
To summerize I need to invoke parallelly processDataAgainstEmail(,) with sparksession
In case it is not all possible to pass spark sessions, the method won't be able to perform anything on the database. I am not sure what would be the alternate way as parallelism on email is a must for my scenario.