Write resultset to hdfs using sparkcontext (one connection - multiple queries)

Question

In Scala, I need to use the same connection to run several queries and write the output to HDFS using spark context. It has to be the same connection because some of the queries create volatile tables, if the connection is closed, the volatile tables will be gone.

I am aware of the following function:

val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

But it will require creating a connection each time I run a query. Is there any other alternative? I can get a result set from a Connection Object, but how give the rs to sqlcontext to write the data to HDFS?

Cant you insert all volatile table information with one query/connection in to a staging table and again do another query to separately select/create volatile tables ? — Ram Ghadiyaram, Nov 08 '16 at 06:13
I cannot create all the volatile tables in one query, these are 4 different tables each depend on one or more of the previously created volatile tables. The only way I came across to use one connection and run more than one query is using the normal executeQuery function, but that make little use of spark and it ends up crashing because of the high amount of returned data. I also came across the JdbcRDD method from this link http://www.infoobjects.com/spark-sql-jdbcrdd/ and I faced a "Object not serializable" exception that seems to be none resolvabe — Sarah, Nov 09 '16 at 23:04
so that means sharing connection , its not possible as I mentioned. — Ram Ghadiyaram, Nov 10 '16 at 03:07

score 0 · Answer 1 · edited May 23 '17 at 12:06

AFAIK you cant share the same connection across multiple workers. Each partition may be processed on a different machine, so they cannot share one connection either in old approach i.e jdbcrdd or new approach using dataframes.

pls have a look at

How ever, I found something interesting (not built-in api option , I haven't tested) ShardedJdbcRDD.scala you can try like this.

Write resultset to hdfs using sparkcontext (one connection - multiple queries)

1 Answers1