I am using spark, coding in python
I have a sparkcontent RDD composed of json objects, which are dictionaries. I'd like to select and group specific key/value pairs from each entry (json object) in the RDD and group them and then collect them.
For example: Each entry in the RDD contains many (key:value) pairs, among these,
the first entry contains: 'str_id' : 000000 ,'text' : "text here"
the second entry contains: 'str_id' : 000001 ,'text' : "new text"
...
Id like to collect the 'str_id' and 'text' values from each entry together in the RDD, to create a new RDD containing the following entries:
[(000000, "text here"), (000001, "new text"),...]
Unfortunately I cannot figure out how to map these key:value pairs because the the dictionary key:value pairs are inside of each RDD entry.
Any help with this would be appreciated
Edit: Resolved
I wanted to work within the RDD system because I am working with a large amount of data, which is why I didn't use .collect().
rdd = sc.textFile(./json-data.txt)
rdd_entry = rdd.map(lambda x: jform(x) \
.map(lambda y: val_get(y,"text","user"))
Where val_get() is a function that returns dictionary entries combined in a tuple, and jform() converts strings to json objects.
I realized that the reason I was getting errors was due to not filtering out the RDD for loose, non-json objects that got past the first mapping. I had original thought that mapping from a dictionary entry in an RDD wouldn't work but I was mistaken.
Thanks