When I look for ways to parse json within a string column of a dataframe, I keep running into results that more simply read json file sources. My source is actually a hive ORC table with some strings in one of the columns which is in a json format. I'd really like to convert that to something parsed like a map.
I'm having trouble finding a way to do this:
import java.util.Date
import org.apache.spark.sql.Row
import scala.util.parsing.json.JSON
val items = sql("select * from db.items limit 10")
//items.printSchema
val internal = items.map {
case Row(externalid: Long, itemid: Long, locale: String,
internalitem: String, version: Long,
createdat: Date, modifiedat: Date)
=> JSON.parseFull(internalitem)
}
I thought this should work, but maybe there's a more Spark way of doing this instead because I get the following error:
java.lang.ClassNotFoundException: scala.Any at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
Specifically, my input data looks approximately like this:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{'name':'thing','attr':{
21:{'attrname':'size','attrval':'big'},
42:{'attrname':'color','attrval':'red'}
}}", 1, 2017-05-05…, 2017-05-06…
Yes it's not RFC 7158 exactly.
The attr
keys can be 5 to 30 of any 80,000 values, so I wanted get to something like this instead:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{"name':"thing","attr":[
{"id":21,"attrname':"size","attrval":"big"},
{"id":42,"attrname":"color","attrval":"red"}
]}", 1, 2017-05-05…, 2017-05-06…
Then flatten the internalitem
to fields and explode the attr
array:
externalid, itemid, locale, name, attrid, attrname attrval, version, createdat, modifiedat
123, 321, "en_us", "thing", 21, "size", "big", 1, 2017-05-05…, 2017-05-06…
123, 321, "en_us", "thing", 21, "color", "red", 1, 2017-05-05…, 2017-05-06…