Loading JSON dataset into Spark, then use filter, map, etc

Question

I'm new to Apache Spark, and would like to take a dataset saved in JSON (a list of dictionaries), load it into an RDD, then apply operations like filter and map. This seems to me like it should be simple, but after looking around Spark's docs the only thing I found used SQL queries (https://spark.apache.org/docs/1.1.0/sql-programming-guide.html), which is not how I'd like to interact with the RDD.

How can I load a dataset saved in JSON into an RDD? If I missed the relevant documentation, I'd appreciate a link.

Thanks!

The same documentation says that using SQL is only one alternative: you can use jsonRDD to query your data in hierarchical way. `val anotherPeopleRDD = sc.parallelize( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil); val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)` — Ashalynd, Jan 13 '15 at 20:29

score 3 · Answer 1 · answered Jan 14 '15 at 00:59

You could do something like

import org.json4s.JValue
import org.json4s.native.JsonMethods._

val jsonData: RDD[JValue] = sc.textFile(path).flatMap(parseOpt)

and then do your JSON processing on that JValue, like

jsonData.foreach(json => {
  println(json \ "someKey")
  (json \ "id") match {
    case JInt(x) => ???
    case _ => ???
})

score 1 · Accepted Answer · answered May 20 '15 at 22:26

1

Have you tried appling json.loads() in the mapping?

import json
f = sc.textFile('/path/to/file')
d = lines.map(lambda line: json.loads(line))

answered May 20 '15 at 22:26

Aaron Bannin

127
2
9

Basically this is what I did, yes. I read the file as raw text, line by line, and apply the json.loads function. Thanks for your answer! – Brandt Jul 30 '15 at 22:46

Loading JSON dataset into Spark, then use filter, map, etc

2 Answers2

Linked