4

I have a file with normal columns and a column that contains a Json string which is as below. Also picture attached. Each row actually belongs to a column named Demo(not Visible in pic).The other columns are removed and not visible in pic because they are not of concern for now.

[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]

Please do not change the format of the JSON since it is as above in the data file except everything is in one line.

Each row has one such object under column say JSON. The objects are all in one line but in a array.I would like to Parse this column using spark and access he value of each object inside. Please help.

What I want is to get value of key "value". My objective is to extract value of "value" key from each JSON object into separate columns.

I tried using get_json_object. It works for the following 1) Json string but returns null for the JSON 2)

  1. {"key":"device_kind","value":"desktop"}
  2. [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]

code I tried is as below

val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")

jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind

val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")

jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind

Next I wanted to use from_Json but I am unable to figure out how to build schema for Array of JSON objects. All examples I find are that of nested JSON objects but nothing similar to the above JSON string.

I did find that in sparkR 2.2 from_Json has a boolean parameter if set true it will handle the above type of JSON string i.e Array of JSON objects but that option is not available in Spark-Scala 2.3.3

To be clear on input and expected output it should be as below.

i/p below

+------------------------------------------------------------------------+
|Demographics                                                            |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+

Expected o/p below

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+

enter image description here

Bishamon Ten
  • 479
  • 1
  • 6
  • 20
  • 1
    What is your wanted output? Maybe this can help: https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes – Shaido Sep 17 '19 at 09:00
  • What does parse mean and access the value of each object? A bit confused. – Lamanus Sep 17 '19 at 12:35

2 Answers2

4

Aleh thank you for answer.It works fine. I did the solution in slightly different way because I am using 2.3.3 spark.

val sch = ArrayType(StructType(Array(
  StructField("key", StringType, true),
  StructField("value", StringType, true)
)))

val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))

val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
  .withColumn("country_code", expr("Demographics[1].value"))
  .withColumn("device_platform", expr("Demographics[2].value"))
Bishamon Ten
  • 479
  • 1
  • 6
  • 20
1

If your column with JSON looks like this

    import spark.implicits._

    val inputDF = Seq(
      ("""[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]"""),
      ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}]"""),
      ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}]""")
    ).toDF("Demographics")

  inputDF.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|Demographics                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------+
|[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]|
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}] |
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}] |
+-------------------------------------------------------------------------------------------------------------------------+

you can try to parse the column in the following way:

  val parsedJson: DataFrame = inputDF.selectExpr("Demographics", "from_json(Demographics, 'array<struct<key:string,value:string>>') as parsed_json")

  val splitted = parsedJson.select(
    col("parsed_json").as("Demographics"),
    col("parsed_json").getItem(0).as("device_kind_json"),
    col("parsed_json").getItem(1).as("country_code_json"),
    col("parsed_json").getItem(2).as("device_platform_json")
  )

  val result = splitted.select(
    col("Demographics"),
    col("device_kind_json.value").as("device_kind"),
    col("country_code_json.value").as("country_code"),
    col("device_platform_json.value").as("device_platform")
  )

  result.show(false)

You will get the output:

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+
Aleh Pranovich
  • 361
  • 1
  • 7
  • Hi Aleh, This way of transformation makes in difficult to query the relevant country code, device platform for a particular device kind say desktop.I would like to form columns device_kind country_code, device_platform and have their corresponding values for each row. – Bishamon Ten Sep 20 '19 at 08:13
  • Hi BishamonTen. I've edited the answer. Check if the solution is what you need. – Aleh Pranovich Sep 20 '19 at 10:03
  • Hi Aleh,May I know if you have some experience regarding Unit testing and also integrated testing of Spark code/application. I have some code that I want to unit test. Can you help me?. – Bishamon Ten Sep 21 '19 at 11:04
  • Hi Bishamon. If you share additional information, I can say how I can help. – Aleh Pranovich Sep 22 '19 at 08:52
  • Hi Aleh. I have posted my question in this link https://stackoverflow.com/questions/58066394/spark-unit-test-i-have-one-function-that-unions-3-input-datasets-should-i-do.Please provide your valuable inputs – Bishamon Ten Sep 24 '19 at 07:10