My data got unicoded! How to fix this?

Asked Sep 09 '16 at 21:54

Active Sep 09 '16 at 21:54

Viewed 39 times

Following the Prepare my bigdata with Spark via Python question, I did:

data = data.map(lambda x: (x[0], json.loads(x[1])))
r = data.flatMap(lambda x: ((a,x[0]) for a in x[1]))
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
r2.saveAsTextFile('foo')

Now when I read it like:

data = sc.textFile('foo')
dataList = data.collect()

I get:

In [4]: dataList[8191][0]
Out[4]: u'('

In [5]: dataList[8191]
Out[5]: u"(5119, (u'5873341528', u'8599419601', u'5155716547'))"

In [6]: len(dataList[8191][1])
Out[6]: 1

In [7]: dataList[8191][1]
Out[7]: u'5'

while the data were meant to be accessed like the first item to be 5119, and the second the tuple. I see the unicode sign there, where I didn't asked for it..

Any ideas?

edited May 23 '17 at 12:08

Community

asked Sep 09 '16 at 21:54

gsamaras

71,951
46
188
305

1

You saved the data as a text file, what did you expect? – juanpa.arrivillaga Sep 09 '16 at 21:58
Use serialization. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.pickleFile – juanpa.arrivillaga Sep 09 '16 at 22:05
Thank you very much @juanpa.arrivillaga, but the dupe worked fine! :) – gsamaras Sep 09 '16 at 22:07

My data got unicoded! How to fix this?

0 Answers0