Our requirement is to do some analytical operations on Phoenix (HBase) timeseries tables. we have a table in postgresql which has uniqueIds.
Right now we are getting all uniqueIds from postgresql table and querying the Phoenix tables for corresponding uniqueIds and applying the analytical functions.But here all uniqueIds are processing in sequential manner. we need this to run parallel. we are using scala and spark to achieve this functionality.
Below is the sample code,
val optionsMap = Map("driver" -> config.jdbcDriver, "url" -> config.jdbcUrl,
"user" -> config.jdbcUser, "password" -> config.jdbcPassword,
"dbtable" -> query)
val uniqDF = sqlContext.read.format("jdbc").options(optionsMap).load()
val results = uniqDF.collect
results.foreach { uniqId =>
val data = loadHbaseData(uniqId)
data.map(func).save()
}
def loadHbaseData(id: String): DataFrame = {
sqlContext.phoenixTableAsDataFrame("TIMESERIETABLE", Array("TIMESTAMP", "XXXX",""), predicate = Some("\"ID\" = '" + uniqueId + "' "), conf = configuration)
}
could you please let me know what is the better approach in doing this?