0

I was trying to convert JDBC ResultSet to Spark RDD and was looking for an efficient way to do that using parallelism feature of Spark.

Below is what i have implemented as per this https://stackoverflow.com/a/32073423/6064131

val rs:ResultSet = stmt .getResultSet
val colCount = rs.getMetaData.getColumnCount

def getRowFromResultSet(resultSet: ResultSet): String ={
  var i:Int = 1
  var rowStr=""
  while(i<=colCount){
    rowStr=rowStr+resultSet.getString(i)+delim
    i+=1
  }
  rowStr
}

val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
  getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList

val x = sc.parallelize(resultSetList)

Now the main issue is it is taking more time and i understand all dataset is pulled through one needle eye.But is there any better way to achieve this?

Some might be wondering why i am not using inbuilt feature sqlContext.read.format to achieve this, the reason is Spark wraps a "SELECT * FROM ( )" around the query which is creating issue with complex queries. Please refer the link for details Issue with WITH clause with Cloudera JDBC Driver for Impala - Returning column name instead of actual Data

Arghya Saha
  • 227
  • 1
  • 4
  • 17

1 Answers1

0

But is there any better way to achieve this?

I wouldn't reinvent the wheel. If you still experience the same issue with a recent Spark version (1.6 is pretty old) and JDBC driver (my guess it is the one to blame) just CREATE VIEW and use it for queries.

Also don't forget to file a bug report.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115