Get a count of values from JSON (mongodb doc) using Spark

Question

My MongoDB document looks like this

{  
   "_id":"sdf23sddfsd",
   "the_list":[  
      {  
         "Sentiment":[  
            "Negative",
            "Positive",
            "Positive"
         ]
      },
      {  
         "Sentiment":[  
            "Neutral",
            "Positive"
         ]
      }
   ],
   "some_other_list":[  
      {  
         "Sentiment":[  
            "Positive",
            "Positive",
            "Positive"
         ]
      }
   ]
}

I am trying to write a Spark/Java app to get total count of each Sentiments from the_list and some_other_list

// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());

// Create a custom ReadConfig
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", "tmp");
//readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = 
ReadConfig.create(jsc).withOptions(readOverrides);

// Load data using the custom ReadConfig
JavaMongoRDD<Document> customRdd = MongoSpark.load(jsc, readConfig);

I tested above which can get to values perfectly fine by doing this

System.out.println(((Document)((ArrayList)customRdd.first().get("the_list")).get(0)).get("Sentiments"));
//Prints [Negative, Neutral]

But, I am lost on how to aggregate the sentiment count as such:

{  
   "_id":"sdf23sddfsd",
   "the_list":{  
      "Negative":1,
      "Positive":3,
      "Neutral":1
   },
   "some_other_list":{  
      "Positive":1
   }
}

I got till here, which is wrong because it is looking at only 0 index of the_list

    JavaRDD<String> sentimentsRDD= customRdd.flatMap(document -> ((Document)((ArrayList)document.get("the_list")).get(0)).get("Sentiments"));

I know we can do this in MongoDB directly, but I need to learn how to do in Spark for such structured data so that I use this learning for other use-cases, which require doing some more manipulations on each document in a collection.

Why don't you use DataFrames (but stick to RDDs that are Spark's "assembler")? — Jacek Laskowski, Jul 14 '17 at 01:12
sure @JacekLaskowski, please feel free to suggest a solution using DataFrame, I thought since the the JSON structure can have multiple layers of document embedding. So, DataFrame might not be a good fit here. But I might be wrong. — Watt, Jul 14 '17 at 01:55
If DataFrame API's an option, would that answer help --> https://stackoverflow.com/q/44814926/1305344 ? — Jacek Laskowski, Jul 14 '17 at 02:10

Get a count of values from JSON (mongodb doc) using Spark

0 Answers0