4

I have a DataFrame with a column of string type, this string is a JSON format, I wanted to convert this column to multiple columns based on this JSON format. I can do it if I have the JSON schema, but I don't have it.

Example:

Original Dataframe:

---------------------
|        json_string|
---------------------
|{"a":2,"b":"hello"}|
|   {"a":1,"b":"hi"}|
---------------------

After Conversion/Parse

--------------
|  a |     b |
--------------
|  2 |  hello|
|  1 |     hi|
--------------

I using Apache Spark 2.1.1.

zero323
  • 322,348
  • 103
  • 959
  • 935
Clairton Menezes
  • 85
  • 1
  • 1
  • 7

1 Answers1

17

If you do not have a predefined schema the other option is to convert it to RDD[String] or Dataset[String] and load as a json

Here is how you can do

//convert to RDD[String]
val rdd = originalDF.rdd.map(_.getString(0))

val ds = rdd.toDS

Now load as a json

val df = spark.read.json(rdd) // or spark.read.json(ds)

df.show(false)

Also use json(ds), json(rdd) is deprecated from 2.2.0

@deprecated("Use json(Dataset[String]) instead.", "2.2.0")

Output:

+---+-----+
|a  |b    |
+---+-----+
|2  |hello|
|1  |hi   |
+---+-----+
koiralo
  • 22,594
  • 6
  • 51
  • 72