I was trying to convert JDBC ResultSet to Spark RDD and was looking for an efficient way to do that using parallelism feature of Spark.
Below is what i have implemented as per this https://stackoverflow.com/a/32073423/6064131
val rs:ResultSet = stmt .getResultSet
val colCount = rs.getMetaData.getColumnCount
def getRowFromResultSet(resultSet: ResultSet): String ={
var i:Int = 1
var rowStr=""
while(i<=colCount){
rowStr=rowStr+resultSet.getString(i)+delim
i+=1
}
rowStr
}
val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList
val x = sc.parallelize(resultSetList)
Now the main issue is it is taking more time and i understand all dataset is pulled through one needle eye.But is there any better way to achieve this?
Some might be wondering why i am not using inbuilt feature sqlContext.read.format to achieve this, the reason is Spark wraps a "SELECT * FROM ( )" around the query which is creating issue with complex queries. Please refer the link for details Issue with WITH clause with Cloudera JDBC Driver for Impala - Returning column name instead of actual Data