2

I am using spark 1.6.3 to parse a json strucuture

I have a json structure below :

    {
    "events":[
      {
        "_update_date":1500301647576,
        "eventKey":"depth2Name",
        "depth2Name":"XYZ"
      },
      {
        "_update_date":1500301647577,
        "eventKey":"journey_start",
        "journey_start":"2017-07-17T14:27:27.144Z"
      }]
    }

i want parse the above JSON to 3 columns in dataframe. eventKey's value(deapth2Name) is a node in Json(deapth2Name) and i want to read the value from corresponding node add it to a column "eventValue" so that i can accommodate any new events dynamically.

Here is the expected output:

    _update_date,eventKey,eventValue
    1500301647576,depth2Name,XYZ
    1500301647577,journey_start,2017-07-17T14:27:27.144Z

sample code:

    val x = sc.wholeTextFiles("/user/jx665240/events.json").map(x => x._2)
    val namesJson = sqlContext.read.json(x)

    namesJson.printSchema()
    namesJson.registerTempTable("namesJson")
    val eventJson=namesJson.select("events")
    val mentions1  =eventJson.select(explode($"events")).toDF("events").select($"events._update_date",$"events.eventKey",$"events.$"events.eventKey"")

$"events.$"events.eventKey"" is not working.

Can you please suggest how to fix this issue.

Thanks, Sree

philantrovert
  • 9,904
  • 3
  • 37
  • 61
sree
  • 21
  • 2
  • Your example assumes the order of the columns is relevant and does not do any dynamic adjustment based on column values. Is that intentional? I have answered a similar question here: https://stackoverflow.com/questions/43347098/pyspark-cast-array-with-nested-struct-to-string/43376695#43376695 - basically just add `.select($"events.*")` after the toDF call. But that will not be dynamically renaming columns. FYI, better: val namesJson = spark.read.json(sc.wholeTextFiles("/user/jx665240/events.json").values) from https://stackoverflow.com/questions/38545850/read-multiline-json-in-apache-spark – Garren S Aug 09 '17 at 17:00
  • If i do $"events.*" i will get all the columns ., which is not i am expecting. Basically i will get _update_date,eventKey,depth2Name,journey_start if i do select *., i want to get _update_date,eventKey,eventValue ( event value will have values from depth2Name and Journey_Start) . Can you please have a look at the expected output and json i have provided and let me know if you need any info. Again thank you for looking into this – sree Aug 09 '17 at 19:49

0 Answers0