2

I'm new to Apache Spark, and would like to take a dataset saved in JSON (a list of dictionaries), load it into an RDD, then apply operations like filter and map. This seems to me like it should be simple, but after looking around Spark's docs the only thing I found used SQL queries (https://spark.apache.org/docs/1.1.0/sql-programming-guide.html), which is not how I'd like to interact with the RDD.

How can I load a dataset saved in JSON into an RDD? If I missed the relevant documentation, I'd appreciate a link.

Thanks!

Brandt
  • 565
  • 1
  • 6
  • 13
  • 1
    The same documentation says that using SQL is only one alternative: you can use jsonRDD to query your data in hierarchical way. `val anotherPeopleRDD = sc.parallelize( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil); val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)` – Ashalynd Jan 13 '15 at 20:29

2 Answers2

3

You could do something like

import org.json4s.JValue
import org.json4s.native.JsonMethods._

val jsonData: RDD[JValue] = sc.textFile(path).flatMap(parseOpt)

and then do your JSON processing on that JValue, like

jsonData.foreach(json => {
  println(json \ "someKey")
  (json \ "id") match {
    case JInt(x) => ???
    case _ => ???
})
tgpfeiffer
  • 1,698
  • 2
  • 18
  • 22
1

Have you tried appling json.loads() in the mapping?

import json
f = sc.textFile('/path/to/file')
d = lines.map(lambda line: json.loads(line))
Aaron Bannin
  • 127
  • 2
  • 9
  • Basically this is what I did, yes. I read the file as raw text, line by line, and apply the json.loads function. Thanks for your answer! – Brandt Jul 30 '15 at 22:46