5

I have the following string in one of my dataframe's column:

row1:[{"key":"foo"},{"key":"bar"},{"key":"baz"}]
row2:[{"key":"foo"},{"key":"bar"}]
row3:null
etc

I found that Spark has "get_json_object" function. So if I want to use xpath to extract data I would use:

 get_json_object($"json", s"$[0].key")

would returns:

"foo"
"foo"
null

but I need the equivalent of "explosion" function of Spark.

I found that I can use the "*" symbol on my xpath.

 get_json_object($"json", s"$[*].key")

Which don't do the as expected, it will create a string like:

[foo,bar,baz]
[foo,baz]

I found the solution in another stackoverflow thread,

val jsonElements = (0 until 3).map(i => get_json_object($"json", s"$$[$i].key"))


val jsonElements = .map(i => get_json_object($"json", s"$$[$i].key"))
df.select($"id",explode(array(jsonElements: _*).alias("foo")))

This partially solve my issue, because this solution presumes that I know how maximum depth my array can be. The function "from_json" of Spark needs schema, I have huge complex JSON type would take an "infinity" amount of time to create the schema.

Disclaimer

I will not use any regular expression/substring/etc to parse JSON. The entire propose of use a parser is that.

zero323
  • 322,348
  • 103
  • 959
  • 935
Mantovani
  • 500
  • 2
  • 7
  • 18
  • `get_json_object` returns a string, not a parsed object, so that's not an option. Take a look at [`from_json`](https://stackoverflow.com/q/34069282/6910411) – zero323 Oct 30 '18 at 21:47

1 Answers1

-2

Just sticking to the scala basics can solve it simple. Try case classes with options to solve the problem.

You can use any standard json parser. I use liftweb.

import net.liftweb.json.{DefaultFormats, parseOpt}

case class jsonElement(key: String, value: Optional[String])
//assuming the value key always exists and value may or may not exist, 
//so making that as optional / ignore the fields if you don't really care at all

val jsonKeys = inputRdd.map(eachRow => 
  implicit val formats = DefaultFormats // hate this but deal with scala

  val parsedObject = parseOpt(eachRow).flatMap(_.extractOpt[List[jsonElement]])

  parsedObject match{
    case Some(parsedItem) => parsedItem.map(json => json.key)
    case None => List()
})

This gives a Rdd of list(key). Use filter(list => !list.isEmpty) if you want to remove the empty lists. You know it from there.

Munna
  • 336
  • 1
  • 12