How to parse JSON in Spark with fasterxml without SparkSQL?

Question

I got this far:

import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature

case class Person(name: String, lovesPandas: Boolean)

val mapper = new ObjectMapper()

val input = sc.textFile("files/pandainfo.json")
val result = input.flatMap(record => {
    try{
        Some(mapper.readValue(record, classOf[Person]))
    } catch {
        case e: Exception => None
    }
})
result.collect

but get Array() as a result (with no error). The file is https://github.com/databricks/learning-spark/blob/master/files/pandainfo.json How do I go on from here?

After consulting Spark: broadcasting jackson ObjectMapper I tried

import org.apache.spark._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature

case class Person(name: String, lovesPandas: Boolean)

val input = """{"name":"Sparky The Bear", "lovesPandas":true}"""
val result = input.flatMap(record => {
    try{
        val mapper = new ObjectMapper()
        mapper.registerModule(DefaultScalaModule)
        mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
        Some(mapper.readValue(record, classOf[Person]))
    } catch {
        case e: Exception => None
    }
})
result.collect

and got

Name: Compile Error
Message: <console>:34: error: overloaded method value readValue with alternatives:
  [T](x$1: Array[Byte], x$2: com.fasterxml.jackson.databind.JavaType)T <and>
  [T](x$1: Array[Byte], x$2: com.fasterxml.jackson.core.type.TypeReference[_])T <and>
  [T](x$1: Array[Byte], x$2: Class[T])T <and>

I've only been googling, but do you need `mapper.registerModule(DefaultScalaModule)`? Also, have you tried to parse a Person from a literal String outside of Spark just to check that bit's working OK? — The Archetypal Paul, May 19 '16 at 13:23
@TheArchetypalPaul This can be resolved with little google help and some debugging.. — Yuval Itzchakov, May 19 '16 at 13:44
@YuvalItzchakov, well, yes, but I'm not sure why you addressed that comment to me! — The Archetypal Paul, May 19 '16 at 13:47
@TheArchetypalPaul: 1) If I add this, I get `Name: org.apache.spark.SparkException Message: Task not serializable` 2) How do I add the literal string? `val text = new String('{"name":"Sparky The Bear", "lovesPandas":true}')` gives `Message: :1: error: unclosed character literal` — Make42, May 19 '16 at 13:48
I did my google beforehand. (Though regarding 2): forget it ;-).) — Make42, May 19 '16 at 13:52
To include quotes in a quoted string, one reads any basic introduction/tutorial to Scala. Really, from this and other recent questions, you're making your task more difficult by not taking a short time out to look up some Scala basics. And start with the simple stuff - get the JSON reading working, then add the Spark part. — The Archetypal Paul, May 19 '16 at 14:02

Erica · Answer 1 · 2016-05-19T15:35:57.493

1

I see that you tried the Learning Spark examples. Here the reference to the complete code https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/BasicParseJsonWithJackson.scala E.

edited May 19 '16 at 15:35

answered May 19 '16 at 12:34

Erica

9
3

This does not work in jupyter notebook. Jupyter does not know how to import the jackson classes. Any idea? – Make42 May 19 '16 at 18:20
Can you explain better? – Erica May 20 '16 at 09:42
The code you pointed me to does not work in the jupyter notebook. I am starting to suspect that this is not because the code is wrong, but because of the notebook. Maybe the code from learning spark is not working well together with the notebook. – Make42 May 20 '16 at 10:40
Maybe you need to change the metadata of the notebook adding the jackson package. – Erica May 20 '16 at 12:24

score 0 · Answer 2 · edited May 23 '17 at 11:48

Instead of sc.textfile("path\to\json")you can try this(i write it in java cuz i don't know scala but the API is the same):

SQLContext sqlContext = new SQLContext(sc);
DataFrame dfFromJson = sqlContext.read().json("path\to\json\file.json");

spark will read your json file and convert it as a dataframe.

If your json file is nested, you could use

org.apache.spark.sql.functions.explode(e: Column): Column

for example, see my answer here

Hope this help you.

How to parse JSON in Spark with fasterxml without SparkSQL?

2 Answers2

Linked