1

I am relatively new to Spark and java programming. Given a json file with nested objects I need to flatten its structure(denorm the contents) and load into Elastisearch using spark.

For instance,

if the contents of my example.json is:

{
  "title": "Nest eggs",
  "body":  "Making your money work...",
  "tags":  [ "cash", "shares" ],
  "comments": 
    {
      "name":    "John Smith",
      "comment": "Great article",
      "age":     28,
      "stars":   4,
      "date":    "2014-09-01"
    }
  "owner": 
    {
      "name":    "John Smith",
      "age":     28,
    }
}

I would want to reconstruct this in the below format and load it into ES using spark.

{
  "title": "Nest eggs",
  "body":  "Making your money work...",
  "tags":  [ "cash", "shares" ],
  "comments_name": "John Smith",
  "comments_comment": "Great article",
  "comments_age":     28
  "comments_stars":   4,
  "comments_date":    "2014-09-01"
  "owner_name": "John Smith",
  "owner_age":     28,
 }

In case one of the nested objects is empty, the contents can be left empty too.

Any help is appreciated. Thanks

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
fledgling
  • 991
  • 4
  • 25
  • 48
  • Semantically, there's no difference between `"comments.name"` (a field `name` nested instead a `comments` field) and `"comments_name"` (a top-level field) – Val Apr 28 '17 at 14:10
  • Agreed. It can be in either form. I want it to be a flattened json before loading it to the index in ES. – fledgling Apr 28 '17 at 14:19

1 Answers1

-1

The answer you are looking for is here.

To summarize, you can just select out the fields you need via dot notation.

val df = sqlcontext.read.json(json)    
val flattened = df.select($"title", $"comments.name")
Community
  • 1
  • 1
Alex Naspo
  • 2,052
  • 1
  • 20
  • 37