3

Need to remove non-printable characters from rdd.

Sample data is below

"@TSX•","None"
"@MJU•","None"

expected output

@TSX,None
@MJU,None

Tried below code but its not working

sqlContext.read.option("sep", ","). \
                option("encoding", "ISO-8859-1"). \
                option("mode", "PERMISSIVE").csv(<path>).rdd.map(lambda s: s.replace("\xe2",""))
LUZO
  • 1,019
  • 4
  • 19
  • 42

2 Answers2

1

You can use textFile function of sparkContext and use string.printable to remove all special characters from strings.

import string
sc.textFile(inputPath to csv file)\
    .map(lambda x: ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')]))\
    .saveAsTextFile(output path )

Explanation

For your input line "@TSX•","None"
for y in x.split(',') splits the string line to ["@TSX•", "None"] where y represent each elements in the array while iterating
for e in y if e in string.printable is checking each character in y is printable or not
if printable then the characters are joined to form a string of printable characters
.strip('\"') removes the preceding and ending inverted commas from the printable string
finally the list of Strings is converted to comma sepated string by ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')])

I hope the explanation is clear enough to understand

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • code is working but here i need to remove only non printable characters, i need special characters in this data. is there a way to do it? – LUZO Mar 20 '18 at 15:58
  • ERROR: .map(lambda x: ','.join([filter(lambda e: e in string.printable, y) for y in x.split(',')]))\ TypeError: sequence item 0: expected str instance, filter found – LUZO Mar 20 '18 at 16:19
  • Not working getting below output: , – LUZO Mar 20 '18 at 16:38
  • Its working perfectly fine :) can you please elaborate your code. it will be easy for me to understand each and every part – LUZO Mar 20 '18 at 17:39
  • 1
    I have explained as much as I could @LUZO – Ramesh Maharjan Mar 20 '18 at 17:54
  • Hello @RameshMaharjan, do you have similar approach in scala. My day is exhausted finding it but no benefit. Thanks. – Kanav Sharma Aug 31 '18 at 14:18
  • @KanavSharma in scala you can simply do `sc.textFile(input path ).map(_.split(",").map(x => x.replaceAll("^\"|\"$", "").replaceAll("[^\\x00-\\x7F]", "")).mkString(",")).saveAsTextFile(output path )` what it does is replace all non-ascii characters with empty and replace the beginning and ending inverted comma . Hope its helpful and you can upvote the answer if its helpful – Ramesh Maharjan Sep 01 '18 at 06:37
  • @RameshMaharjan, thanks. but what happens is that my unwanted character is being treated by spark as a newline character. So once I read the file using spark via creating rdds or dataframes, it is omitted by the replaceall function. – Kanav Sharma Oct 26 '18 at 12:21
1

One option is to try to filter your text using string.printable:

import string
sqlContext.read\
    .option("sep", ",")\
    .option("encoding", "ISO-8859-1")\
    .option("mode", "PERMISSIVE")\
    .csv(<path>)\
    .rdd\
    .map(lambda s: filter(lambda x: x in string.printable, s))

Example

import string
rdd = sc.parallelize(["TSX•,None","MJU•,None", "!@#ABC,*()XYZ"])

print(rdd.map(lambda s: filter(lambda x: x in string.printable, s)).collect())
#['TSX,None', 'MJU,None', '!@#ABC,*()XYZ']

References

pault
  • 41,343
  • 15
  • 107
  • 149
  • I dont want to remove special characters. Is there a way ? – LUZO Mar 20 '18 at 16:20
  • @LUZO this solution **doesn't** remove special characters. `string.printable` includes all characters that can be printed. The `filter` statement removes only characters that are not found in `string.printable`. Please try it and let me know if it works. – pault Mar 20 '18 at 16:21
  • @LUZO I have updated the example I provided to show that special characters are retained with this solution. – pault Mar 20 '18 at 16:25
  • I have tried your code. i am getting mentioned error : AttributeError: Can't pickle local object '..' – LUZO Mar 20 '18 at 16:26
  • That seems unrelated to the non-printable characters issue. I suspect that there's more code you're using that you're not sharing. Can you post edit the question and add the full code? – pault Mar 20 '18 at 16:28
  • i have edited the sample data. i just want to remove non printable so i didnt included everything – LUZO Mar 20 '18 at 16:31