I have a JSONB column called data in a Postgres DB. I am hoping to query and process this column with Spark. What I have figured out so far:
1.Set up DataFrame via JDBC
val df = sqlContext.load("jdbc", Map("url"->"jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar","dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
2.Select and cast the column data into RDD[row] and then RDD[String]
val myRdd = df.select("data").rdd().map(row=>row.toString())
3.Use Spark SQL to cast RDD[String] into JsonRDD
val jsonRdd = sqlContext.read.json(myRdd)
Is there a more straightforward way? It seems to be a big detour to cast json into string and back into json. Also, step 3 is extremely slow - may be read.json() is not lazy?