1

I have a JSONB column called data in a Postgres DB. I am hoping to query and process this column with Spark. What I have figured out so far:

1.Set up DataFrame via JDBC

val df = sqlContext.load("jdbc", Map("url"->"jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar","dbtable"->"mydb", "driver"->"org.postgresql.Driver"))

2.Select and cast the column data into RDD[row] and then RDD[String]

val myRdd = df.select("data").rdd().map(row=>row.toString())

3.Use Spark SQL to cast RDD[String] into JsonRDD

val jsonRdd = sqlContext.read.json(myRdd)

Is there a more straightforward way? It seems to be a big detour to cast json into string and back into json. Also, step 3 is extremely slow - may be read.json() is not lazy?

CJLam
  • 801
  • 3
  • 10
  • 15
  • 1
    _may be read.json() is not lazy?_ - yes, in this case it is not exactly lazy. Since you don't provide schema it has to perform full data scan to infer it and to do it has to process all the data. You can parse JSON directly (see the second solution in the linked answer) but keep in mind that arbitrary JSON cannot be represented in Spark SQL. – zero323 Dec 25 '15 at 07:40
  • Thank you. I was questioning Spark's ability to query Postgresql's JSONB data. I will rephrase and ask another question – CJLam Dec 28 '15 at 02:52
  • The only way is to pass query statically in a table definition. See http://stackoverflow.com/a/32585936/1560062 Otherwise you have to parse JSON string as show in the question linked a duplicate. You could probably modify data source code to add schema inference on top of json/jsonb and convert data to structs but in the worst case scenario it would require full data scan. – zero323 Dec 28 '15 at 07:57

0 Answers0