I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
def mapfn(k, v):
print v
import re, string
pattern = re.compile('[\W_]+')
v = pattern.match(v)
print v
for w in v.split():
yield w, 1
I'm afraid I am not sure how to use the library re
or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v
properly to retrieve the new line without any non-alphanumeric chars.
Suggestions?