I am new to SPARK and figuring out a better way to achieve the following scenario. There is a database table containing 3 fields - Category, Amount, Quantity. First I try to pull all the distinct Categories from the database.
val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning.
categories.foreach(executePipeline)
def execute(category: String): Unit = {
val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
dfCategory.show()
}
Is it possible to do something like this ? Or is there any better alternative ?