How to extract data after mapped DataFrame in scala

Question

I'm new to Spark and I'm trying to make a sense of how it works. I'm trying to create a table(DataFrame) which has the column based on the json file I read. What I've done is :

val example = mysession.read.json("/FileStore/tables/bm2982090/exampleLog_dp_api-fac53.json")

As the documentation says, this returns a DataFrame. Now the DataFrame looks like this :

--------------------+-----------------+------+--------------------+-----+
|                 _id|           _index|_score|             _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|     1|[2017-06-22T08:40...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+

and as you can see, is doing a good job creating the column _id, index etc BUT when it comes to _source is putting everything inside and this happened because the file I've red looks like this:

{
  "_index":"dp_api-2017.06.22",
  "_type":"DPAPI",
  "_id":"AVzO9dqvoaL5S78GvkQU",
  "_score":1,
  "_source":{
    "request":{
      "user_ip":"ip",
      "method":"POST",
      "user_agent":"agent",
      "ts":"2017-06-22T10:40:37.420651+02:00"
    },
    "svc":"example",
    "cipher_info":{
      "userdata":[
        "field:parameters.username,start:0,end:0"
      ]
    },
    "message":"{\"class\":\"DPAPI\",\"event\":\"druid.auth.login\",\"operation\":{\"result\":{\"code\":200,\"description\":\"Success\",\"http_status\":200}},\"parameters\":,\"request\":{\"app_instance\":\"e83f99ff-a768-44d2-a448-9b51a535183f-1498034708\",\"end_point\":\"/oken\",\"method\":\"POST\",\"ts\":\"2017-06-22T10:40:37.420651+02:00\",\"user_agent\":\"Dr.1/Sa-SM-N\"},\"svc\":\"c-1\",\"ts\":\"2017-06-22T10:40:37.420614+02:00\"}",
    "type":"DPAPI",
    "tags":[
      "parse_to_json"
    ],
    "index_name":"dp_api",
    "class":"DPAPI",
    "operation":{
      "result":{
        "code":200,
        "description":"Success",
        "http_status":200
      }
    },
    "parameters":{
      "username":"Xp3opAyI0udKPuSQq5gqkQ=="
    },
    "event_age":0.37999987602233887,
    "ts":"2017-06-22T10:40:37.420614+02:00"
  }
}

so what I think is that when it comes to nested value then the process is "lazy" and it doesn't create the column correctly. My question is: Is it possible to extract the content of _source before creating the DataFrame? Or doing it after the dataframe is created, is fine but the output should look like that I have a table which has as many column as the keys. For example :

--------------------+-----------------+------+--------------------+-----+
|                 _id|           _index|_score| _source.request.user_ip|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|     1|ip              |DPAPI|
+--------------------+-----------------+------+--------------------+-----+

hope I explained well! thanks

use `example.printSchema`. You'll be able to see the fields clearly. — philantrovert, Jun 23 '17 at 10:40
I think this answer should help you https://stackoverflow.com/a/37473765/4390959 — gorros, Jun 23 '17 at 11:00

score 2 · Accepted Answer · answered Jun 23 '17 at 12:44

I guess extracting after you create dataframe should be efficient and faster than extracting before you create a dataframe.

You have used inbuilt functions and APIs for creating dataframe from json. If you extract before that, then you will have to write some custom APIs yourself which won't be as efficient as the provided APIs.

You can just select the columns you require after the dataframe is created as below

example.select("_id", "_index", "_score", "_source.request.user_ip", "_type").show(false)

Which should give you the dataframe as you require

+--------------------+-----------------+------+-------+-----+
|_id                 |_index           |_score|user_ip|_type|
+--------------------+-----------------+------+-------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22|1     |ip     |DPAPI|
+--------------------+-----------------+------+-------+-----+

I hope I answered your query well.

I am glad that I could help you :) thanks for the upvote and acceptance — Ramesh Maharjan, Jun 23 '17 at 16:23

How to extract data after mapped DataFrame in scala

1 Answers1