1

My sentence is say, "I want to remove this string so bad." I passed this text file as

text = sc.textFile(...)

and I want to filter out(i.e remove) the word "string" I noticed that in python, there is a "re" package. I tried doing

RDD.map(lambda x: x.replaceAll("<regular expression>", ""))

to filter out the "string" but seems like there is no such function in PySpark because it gave me an error.. How do I import "re" package? or is there any other function that I can use to remove/filter out certain string based on regular expression in PySpark?

kys92
  • 73
  • 3
  • 8

3 Answers3

1

you can simply import re package as shown below.

import re

text = sc.textFile(...)

out = re.sub("string", '', text)
print out
Sahil Desai
  • 3,418
  • 4
  • 20
  • 41
  • 2
    Looks like a working solution. Maybe you can add some comments on your answer to make it more personal and useful. – picsoung Oct 26 '17 at 21:31
0

I'm not sure about specific provisioning for text in Spark, but a general way to do it (for any kind of var) would be to use the .map() method.

For example:

RDD.map(lambda s: s.replace("string",""))
AlexM
  • 334
  • 2
  • 4
  • 16
  • can "string" inside .replace() be form of a regular expression? – kys92 Oct 26 '17 at 15:14
  • According [to this thread](https://stackoverflow.com/questions/11475885/python-replace-regex) replace cannot do it, but it can be done using re. – AlexM Oct 26 '17 at 15:22
0

To use re.sub on the contents of a textfile, use a lambda function for each line in the file:

rdd_sub = rdd.map(lambda line: re.sub("<regexpattern>", "<newvalue>", line))

devale
  • 41
  • 4