I'm new to Spark and I'm trying to make a sense of how it works. I'm trying to create a table(DataFrame) which has the column based on the json file I read. What I've done is :
val example = mysession.read.json("/FileStore/tables/bm2982090/exampleLog_dp_api-fac53.json")
As the documentation says, this returns a DataFrame. Now the DataFrame looks like this :
--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
and as you can see, is doing a good job creating the column _id, index
etc BUT when it comes to _source
is putting everything inside and this happened because the file I've red looks like this:
{
"_index":"dp_api-2017.06.22",
"_type":"DPAPI",
"_id":"AVzO9dqvoaL5S78GvkQU",
"_score":1,
"_source":{
"request":{
"user_ip":"ip",
"method":"POST",
"user_agent":"agent",
"ts":"2017-06-22T10:40:37.420651+02:00"
},
"svc":"example",
"cipher_info":{
"userdata":[
"field:parameters.username,start:0,end:0"
]
},
"message":"{\"class\":\"DPAPI\",\"event\":\"druid.auth.login\",\"operation\":{\"result\":{\"code\":200,\"description\":\"Success\",\"http_status\":200}},\"parameters\":,\"request\":{\"app_instance\":\"e83f99ff-a768-44d2-a448-9b51a535183f-1498034708\",\"end_point\":\"/oken\",\"method\":\"POST\",\"ts\":\"2017-06-22T10:40:37.420651+02:00\",\"user_agent\":\"Dr.1/Sa-SM-N\"},\"svc\":\"c-1\",\"ts\":\"2017-06-22T10:40:37.420614+02:00\"}",
"type":"DPAPI",
"tags":[
"parse_to_json"
],
"index_name":"dp_api",
"class":"DPAPI",
"operation":{
"result":{
"code":200,
"description":"Success",
"http_status":200
}
},
"parameters":{
"username":"Xp3opAyI0udKPuSQq5gqkQ=="
},
"event_age":0.37999987602233887,
"ts":"2017-06-22T10:40:37.420614+02:00"
}
}
so what I think is that when it comes to nested value then the process is "lazy" and it doesn't create the column correctly. My question is: Is it possible to extract the content of _source
before creating the DataFrame? Or doing it after the dataframe is created, is fine but the output should look like that I have a table which has as many column as the keys.
For example :
--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source.request.user_ip|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|ip |DPAPI|
+--------------------+-----------------+------+--------------------+-----+
hope I explained well! thanks