How to replace/remove regular expression in PySpark RDD?

Question

My sentence is say, "I want to remove this string so bad." I passed this text file as

text = sc.textFile(...)

and I want to filter out(i.e remove) the word "string" I noticed that in python, there is a "re" package. I tried doing

RDD.map(lambda x: x.replaceAll("<regular expression>", ""))

to filter out the "string" but seems like there is no such function in PySpark because it gave me an error.. How do I import "re" package? or is there any other function that I can use to remove/filter out certain string based on regular expression in PySpark?

Sahil Desai · Accepted Answer · 2017-10-27T09:52:31.920

1

you can simply import re package as shown below.

import re

text = sc.textFile(...)

out = re.sub("string", '', text)
print out

edited Oct 27 '17 at 09:52

answered Oct 26 '17 at 19:12

Sahil Desai

3,418
4
20
41

2

Looks like a working solution. Maybe you can add some comments on your answer to make it more personal and useful. – picsoung Oct 26 '17 at 21:31

score 0 · Answer 2 · answered Oct 26 '17 at 15:10

0

I'm not sure about specific provisioning for text in Spark, but a general way to do it (for any kind of var) would be to use the .map() method.

For example:

RDD.map(lambda s: s.replace("string",""))

answered Oct 26 '17 at 15:10

AlexM

334
2
4
16

can "string" inside .replace() be form of a regular expression? – kys92 Oct 26 '17 at 15:14
According [to this thread](https://stackoverflow.com/questions/11475885/python-replace-regex) replace cannot do it, but it can be done using re. – AlexM Oct 26 '17 at 15:22

score 0 · Answer 3 · answered May 19 '23 at 09:47

0

To use re.sub on the contents of a textfile, use a lambda function for each line in the file:

rdd_sub = rdd.map(lambda line: re.sub("<regexpattern>", "<newvalue>", line))

answered May 19 '23 at 09:47

devale

41
4

How to replace/remove regular expression in PySpark RDD?

3 Answers3