As you can see unicode character classes like \p{L}
are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L}
can be replaced with [^\W\d_]
with the UNICODE
flag (even if there are small differences between these two character classes, see the link in comments).
Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:
import re
s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''
p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U)
words = p.findall(s)
print('\n'.join(words))
Notices:
To obtain the same result with python 2.7 you only need to add an u
before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w
with [^\W\d_]
in the pattern.
If you use the regex module, maybe the character class \p{IsLatin}
will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)
Other ways:
By line with the re module:
p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
print(p.split(line+' ')[-2])
With the regex module you can take advantage of the reversed search:
p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
print p.search(line).group(0)