0

I have a PairRDD with a set of key and list of values, each value in the list is a json which I already loaded beginning of my spark app, how can I iterate over each value of the list in my pair RDD to transform it to a string then save the whole content of the key to a file?

my input files look like:

{cat:'red',value:'asd'}
{cat:'green',value:'zxc'}
{cat:'red',value:'jkl'}

The PairRDD looks like

('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])
('green', [{cat:'green',value:'zxc'}])

so as you can see I I'd like to serialize each json in the value list back to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a separate file for each key

The way I got here:

rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")
import json
categoriesJson = rawcatRdd.map(lambda x: json.loads(x))
categories = categoriesJson

catByDate = categories.map(lambda x: (x['cat'], x)
catGroup = catByDate.groupByKey()
catGroupArr = catGroup.mapValues(lambda x : list(x))

Ideally I want to create a cat-red.txt that looks like:

{cat:'red',value:'asd'}
{cat:'red',value:'jkl'}

and the same for the rest of the keys.

I already looked at this answer but I'm slightly lost as host to process each value in the list before I save the contents to a file

Thanks in advance!

Community
  • 1
  • 1
perrohunter
  • 3,454
  • 8
  • 39
  • 55

0 Answers0