Converting JDBC ResultSet to Spark RDD/DataFrame

Question

I was trying to convert JDBC ResultSet to Spark RDD and was looking for an efficient way to do that using parallelism feature of Spark.

Below is what i have implemented as per this https://stackoverflow.com/a/32073423/6064131

val rs:ResultSet = stmt .getResultSet
val colCount = rs.getMetaData.getColumnCount

def getRowFromResultSet(resultSet: ResultSet): String ={
  var i:Int = 1
  var rowStr=""
  while(i<=colCount){
    rowStr=rowStr+resultSet.getString(i)+delim
    i+=1
  }
  rowStr
}

val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
  getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList

val x = sc.parallelize(resultSetList)

Now the main issue is it is taking more time and i understand all dataset is pulled through one needle eye.But is there any better way to achieve this?

Some might be wondering why i am not using inbuilt feature sqlContext.read.format to achieve this, the reason is Spark wraps a "SELECT * FROM ( )" around the query which is creating issue with complex queries. Please refer the link for details Issue with WITH clause with Cloudera JDBC Driver for Impala - Returning column name instead of actual Data

You didn't try Spark 2, I'm guessing? – OneCricketeer Aug 26 '17 at 07:05 — OneCricketeer, Aug 26 '17 at 07:05
@cricket_007 How Spark 2 will make any difference? – Arghya Saha Aug 26 '17 at 07:14 — Arghya Saha, Aug 26 '17 at 07:14
Significant improvement in the SparkSQL code. Just curious – OneCricketeer Aug 26 '17 at 07:16 — OneCricketeer, Aug 26 '17 at 07:16

score 0 · Answer 1 · answered Aug 26 '17 at 07:42

But is there any better way to achieve this?

I wouldn't reinvent the wheel. If you still experience the same issue with a recent Spark version (1.6 is pretty old) and JDBC driver (my guess it is the one to blame) just CREATE VIEW and use it for queries.

Also don't forget to file a bug report.

Converting JDBC ResultSet to Spark RDD/DataFrame

1 Answers1