EDIT: Sorry about previous question quality, I hope this one would be more clear: With Spark application I'm loading whole directory of following JSON files:
{
"type": "some_type",
"payload": {
"data1": {
"id": "1"
},
"data2": {
"id": "1",
},
"data3": {
"id": "1"
},
"dataset1": [{
"data11": {
"id": "1",
},
"data12": {
"id": "1",
}
}],
"masterdata": {
"md1": [{
"id": "1"
},
{
"id": "2"
},
{
"id": "3"
}],
"md2": [{
"id": "1",
},
{
"id": "2",
},
{
"id": "3",
}]
}
}
}
into a DataFrame
and save as temp table in order to use it later. In this Json, fields from "payload" node are always present, but subnodes in "masterdata" are optional.
Next step is creating multiple DataFrames for each subnode of Json like this:
DataFrame data1 contains data of node "data1" from all files and looks like a regular table with column "id".
After first processing part my Spark state is as follow:
DataFrames:
data1(id),
data2(id),
data3(id),
data11(id),
data12(id),
md1(id),
md2(id)
Here comes the problem - if one of the JSON files in directory doesn't contain md2 node, I cannot run neither show()
nor collect()
on "md2" DataFrame due to NullPointException.
I would understand if all files are missing "md2" node so It could not create md2 DataFrame, but in this case I expect md2 DataFrame simply not have data from json file that doesn't have node md2, but contains all others.
Technical details:
To read data from nested node I'm using rdd.map & rdd.flatmap, then I'm, converting it to DataFrame
with custom column names
If I run application when all files in directory contains all nodes everything works, but if a single file is missing md2 node App fails upon .show() or .collect()
BTW If node exists but its empty all works fine.
Is there any way to make Spark support optional Json nodes or handle missing nodes within rdd.map&flatmap?
I hope it's more clear than previous question
On @Beryllium request, here are rdd operations that I'm using to get md2 DataFrame
val jsonData = hiveContext.sql("SELECT `payload`.masterdata.md2 FROM jsonData")
val data = jsonData.rdd.flatMap(row => row.getSeq[Row](0)).map(row => (
row.getString(row.fieldIndex("id"))
)).distinct
val dataDF = data.toDF("id")