4

I am new to python and trying to work on big data code but not able to understand what the expression re.compile(r"[\w']+") means.Anyone has any idea regarding this?

This is the code that i m using.

from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[\w']+")

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        words = WORD_REGEXP.findall(line)
        for word in words:
            yield word.lower(), 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()
lurker
  • 56,987
  • 9
  • 69
  • 103
Larry Paul
  • 61
  • 1
  • 2
  • 5
  • 1
    Lookup "Python regular expressions" and just read the Python documentation regarding regular expressions. You are compiling a regular expression, then using it to search for text that matches that regular expression. – lurker Jun 16 '18 at 14:16
  • Thats what documentation is for: https://docs.python.org/3/library/re.html#re-objects and https://docs.python.org/3/library/re.html#re.compile – Patrick Artner Jun 16 '18 at 14:19
  • 1
    @PatrickArtner i m not getting still how exactly that r"[\w']+" part breaks the line into words – Larry Paul Jun 16 '18 at 14:24
  • 1
    See Zev's explanation. and use http://regex101.com for testing of regex - you even get an explanation for any pattern you provide. I find it better then pythex - and it also got a `python` regex switch – Patrick Artner Jun 16 '18 at 14:30
  • 1
    @PatrickArtner awesome resource! I've added that. – Zev Jun 16 '18 at 14:41
  • Whitespace doesn't match but alphanumerics and apostrophes do. So it breaks it up based on whitespace between the words. I added that to my answer. – Zev Jun 16 '18 at 14:49

2 Answers2

12

This is a regular expression that has been compiled for faster reuse (explained in this question: Is it worth using re.compile). The command re.compile is explained in the Python docs.

Regarding the specific regex expression, this searches for groups that have alphanumerics (that's the \w part) or apostrophes (which is also in those square brackets) that are 1 or longer. Note that whitespace is not a match, so this, generally speaking, breaks a line into words.

See the query in a Python specific regex tester to try it out or on regex101 where they offer an explanation of any regex expression.

In the phrase How's it going $# this would how three matches: How's, it, going but wouldn't match the group of symbols.

There are lots of tutorials and even some games out there but you can start with regexone to understand it better by trying some out.

Zev
  • 3,423
  • 1
  • 20
  • 41
2

With help of re.compile('\W') we can remove special characters from the string.

Example :

str = 'how many $ amount spend for Car??'
pattern = re.compile('\W')
x = re.sub(pattern, ' ', str)
print(x)

Result:

how many amount spend for Car

Note: Special charter "$" and "?" are removed from the string.

4b0
  • 21,981
  • 30
  • 95
  • 142
Hemang Dhanani
  • 175
  • 1
  • 4