I have a very large .tsv
file that has a somewhat strange structure, it has rows like this:
CA 110123140 14228123056896 [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}]]
So, as you can see, it's 4 columns but the 4th one is a json object.
I can load the file to a df on spark with:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load(file_path)
But this:
df.take(1)(0)(3)
yields:
res53: Any = [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}]
which makes it (for me) difficult to parse as a json object.
Ideally, what I would like to have is a dataframe where the columns are they keys of the json object:
"id" "location" "severity" "timestamp"
123 blabla critical 2017-02-19T20:54:43Z
234 blabla critical 2017-02-19T21:54:43Z
So the difficulty is two-fold.
- For each row on the original file, I may have several rows on the final dataframe, as every json object is a list.
- How do I process the json object?
EDIT:
I realize I wasn't very clear on what I actually want. What I would actually like is to be able to access as well the first three columns so that the final df looks like this:
"country " "user" "object" "id" "location" "severity" "timestamp"
CA 98172937 87647563 123 blabla critical 2017-02-19T20:54:43Z
CA 98172937 87647563 234 blabla critical 2017-02-19T21:54:43Z
This is the most difficult part I believe, as it involves inserting somehow the information on the first 3 columns on the json object.