Removing Characters from python Output

Question

I did alot of work to remove the characters from the spark python output like u u' u" [()/'" which are creating problem for me to do the further work. So please put a focus on the same .

I have the input like,

(u"(u'[25145,   12345678'", 0.0)
(u"(u'[25146,   25487963'", 43.0) when i applied code to summing out the result. this gives me the output like
(u'(u"(u\'[54879,    5125478\'"', 0.0)
(u"(u'[25145,   25145879'", 11.0)
(u'(u"(u\'[56897,    22548793\'"', 0.0) so i want to remove all the character like (u'(u"(u\'["'')

I want output like

54879,5125478,0.0

25145,25145879,11.0

the code is i tried is

from pyspark import SparkContext
import os
import sys

sc = SparkContext("local", "aggregate")

file1 = sc.textFile("hdfs://localhost:9000/data/first/part-00000")
file2 = sc.textFile("hdfs://localhost:9000/data/second/part-00000")

file3 = file1.union(file2).coalesce(1).map(lambda line: line.split(','))

result = file3.map(lambda x: ((x[0]+', '+x[1],float(x[2][:-1])))).reduceByKey(lambda a,b:a+b).coalesce(1)

result.saveAsTextFile("hdfs://localhost:9000/Test1")

this code is for aggregating the result based on the key output coming is fine but it contains some u u' u" [()/'" chracter those i want to remove.the output is like (u'(u"(u\'[54879, 5125478\'"', 0.0) (u"(u'[25145, 25145879'", 11.0). so i want to remove all the characters and want output like 54879,5125478,0.0 25145,25145879,11.0 — Deepak Patil, Nov 30 '15 at 13:33

mgaido · Accepted Answer · 2015-11-30T16:18:28.507

1

I think your only problem is that you have to reformat you result before saving it to the file, i.e. something like:

result.map(lambda x:x[0]+','+str(x[1])).saveAsTextFile("hdfs://localhost:9000/Test1")

edited Nov 30 '15 at 16:18

answered Nov 30 '15 at 13:56

mgaido

2,987
3
17
39

Thanks mark but it is giving me the error, result.map(lambda x:x[0]+','+x[1]).saveAsTextFile("hdfs://localhost:9000/Test1") TypeError: coercing to Unicode: need string or buffer, float found – Deepak Patil Nov 30 '15 at 15:34
This is because `x[1]` is a float: you need to convert it to string. I've updated the answer accordingly – mgaido Nov 30 '15 at 16:18
But are those characters in the input too? – mgaido Dec 01 '15 at 08:12
yes Mark, the input file contains (u"(u'[25145, 12345678'", 0.0) (u"(u'[25146, 25487963'", 43.0) when i applied the my code it is giving the output like (u'(u"(u\'[54879, 5125478\'"', 0.0) (u"(u'[25145, 25145879'", 11.0) m wondering from where / are coming and also want to remove all the character – Deepak Patil Dec 01 '15 at 08:34
import string s = "@John, It's a fantastic #week-end%, How about () you" for c in "!@#%&*()[]{}/?<>": s = string.replace(s, c, "") print s how can i use this in my code – Deepak Patil Dec 01 '15 at 08:43
if you just want to have a dirty solution like the code you have put in the comment you just need to include it in a map function after the `textFile` function.. – mgaido Dec 01 '15 at 09:00

Removing Characters from python Output

1 Answers1

Linked