0

I have a very large .tsv file that has a somewhat strange structure, it has rows like this:

CA  110123140   14228123056896  [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}]]

So, as you can see, it's 4 columns but the 4th one is a json object.

I can load the file to a df on spark with:

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load(file_path)

But this:

df.take(1)(0)(3)

yields:

res53:  Any = [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}]

which makes it (for me) difficult to parse as a json object.

Ideally, what I would like to have is a dataframe where the columns are they keys of the json object:

"id"    "location"   "severity"          "timestamp"
 123     blabla      critical       2017-02-19T20:54:43Z
 234     blabla      critical       2017-02-19T21:54:43Z

So the difficulty is two-fold.

  1. For each row on the original file, I may have several rows on the final dataframe, as every json object is a list.
  2. How do I process the json object?

EDIT:

I realize I wasn't very clear on what I actually want. What I would actually like is to be able to access as well the first three columns so that the final df looks like this:

"country "   "user"    "object"  "id"    "location"   "severity"          "timestamp"
    CA      98172937   87647563  123     blabla      critical       2017-02-19T20:54:43Z
    CA      98172937   87647563  234     blabla      critical       2017-02-19T21:54:43Z

This is the most difficult part I believe, as it involves inserting somehow the information on the first 3 columns on the json object.

elelias
  • 4,552
  • 5
  • 30
  • 45
  • Your "json" is of type string and not any but the take method took it without inferring type. Check your data schema with df.printSchema – eliasah Apr 01 '17 at 10:01

1 Answers1

2

you can read your data as rdd and then transform the json column to a dataframe like this:

 val rdd = sc.textFile("path/filet.tsv").map(_.split('\t')(3))
 val df = sqlContext.read.json(rdd)
 df.printSchema
  root
   |-- id: string (nullable = true)
   |-- location: string (nullable = true)
   |-- severity: string (nullable = true)
   |-- timestamp: string (nullable = true)
Mehrez
  • 685
  • 8
  • 14
  • Thanks a lot for that , very helpful. Would you mind taking a look at the EDIT? – elelias Apr 01 '17 at 14:55
  • in this case you can transform your string column to json with from_json function, look at the origin question suggested by @zero323 – Mehrez Apr 02 '17 at 16:22
  • still going to accept this as it was the right answer to the question I posted. – elelias Apr 02 '17 at 19:41