2

When I trying to read a spark dataframe column containing JSON string as array, with a defined schema it returns null. I tried Array, Seq and List for the schema but all returns null. My spark version is 2.2.0

val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""")
dfdata.show(5,false)

val sch = StructType(
  Array(StructField("id", StringType, true),
      StructField("name", StringType, true)))
print(sch.prettyJson )                                             
dfdata.select(from_json($"theJson", sch)).show

and the output

+---------------------------------------------------------------+
|theJson                                                        |
+---------------------------------------------------------------+
|[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]|
+---------------------------------------------------------------+

{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  } ]
}+----------------------+
|jsontostructs(theJson)|
+----------------------+
|                  null|
+----------------------+
Masterbuilder
  • 499
  • 2
  • 12
  • 24
  • Try dfdata.select("theJson").show and then you will get the data you are expecting. – abiratsis Mar 21 '19 at 19:38
  • @AlexandrosBiratsis, that gives only the original JSON string, what I am trying to do is, read the JSON data and split as individual columns like id and name – Masterbuilder Mar 21 '19 at 19:49
  • Possible duplicate of [How do I use a from\_json() dataframe in Spark?](https://stackoverflow.com/questions/52945498/how-do-i-use-a-from-json-dataframe-in-spark) – abiratsis Mar 21 '19 at 19:56
  • I think the link above is the answer you are looking for :) – abiratsis Mar 21 '19 at 19:57
  • tried newDF.select ($"parsed.id",$"parsed.name").show(false), gave +----+----+ |id |name| +----+----+ |null|null| +----+----+ – Masterbuilder Mar 21 '19 at 20:06

2 Answers2

1

Your schema isn't quite right for your example. Your example is an array of structs. Try by wrapping it in an ArrayType:

val sch = ArrayType(StructType(Array(
  StructField("id", StringType, true),
  StructField("name", StringType, true)
)))
Travis Hegner
  • 2,465
  • 1
  • 12
  • 11
0

Have you tried parsing your json string before obtaining a DF?

// obtaining this string should be easy:
val jsonStr = """[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]"""

// then you can take advantage of schema inference
val df2 = spark.read.json(Seq(jsonStr).toDS)

df2.show(false)

// it shows:
// +-----+----+
// |id   |name|
// +-----+----+
// |93993|Phil|
// |838  |Don |
// +-----+----+
  • my input dataset is a csv file with columns of are JSON and other format, I am not sure reading the whole file as JSON file will help my requirement. – Masterbuilder Mar 21 '19 at 21:08