Selecting and grouping dictionary entries from json dictionary RDD by key using spark python

Question

I am using spark, coding in python

I have a sparkcontent RDD composed of json objects, which are dictionaries. I'd like to select and group specific key/value pairs from each entry (json object) in the RDD and group them and then collect them.

For example: Each entry in the RDD contains many (key:value) pairs, among these,

the first entry contains:  'str_id' : 000000 ,'text' : "text here"
the second entry contains: 'str_id' : 000001 ,'text' : "new text"

...

Id like to collect the 'str_id' and 'text' values from each entry together in the RDD, to create a new RDD containing the following entries:

[(000000, "text here"), (000001, "new text"),...]

Unfortunately I cannot figure out how to map these key:value pairs because the the dictionary key:value pairs are inside of each RDD entry.

Any help with this would be appreciated

Edit: Resolved

I wanted to work within the RDD system because I am working with a large amount of data, which is why I didn't use .collect().

rdd = sc.textFile(./json-data.txt)

rdd_entry = rdd.map(lambda x: jform(x) \
                .map(lambda y: val_get(y,"text","user"))

Where val_get() is a function that returns dictionary entries combined in a tuple, and jform() converts strings to json objects.

I realized that the reason I was getting errors was due to not filtering out the RDD for loose, non-json objects that got past the first mapping. I had original thought that mapping from a dictionary entry in an RDD wouldn't work but I was mistaken.

Thanks

Welcome to stack overflow. Your question is a bit unclear and it would be helpful if you could provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples), ideally with code that users can cut-and-paste to recreate a small sample of your data. — pault, Oct 15 '18 at 20:51

score 0 · Answer 1 · answered Oct 15 '18 at 20:51

I am not clear about case but you can obtain expected output with using something like below

>>> rdd = sc.parallelize([{'str_id':'000000' ,'text':'text here'},{'str_id':'000001' ,'text':'new text'}])
>>> rdd.collect()
[{'str_id': '000000', 'text': 'text here'}, {'str_id': '000001', 'text': 'new text'}]

>>> [tuple(k.values()) for k in rdd.collect()]
[('000000', 'text here'), ('000001', 'new text')]

Selecting and grouping dictionary entries from json dictionary RDD by key using spark python

1 Answers1