-1

I’d like to get all the words from a text, including unicode characters, not including hyphens or underscores or any other non-alphanumeric characters.

I.e. I want something like this:

>>> getWords('John eats apple_pie')
['John', 'eats', 'apple', 'pie']
>>> getWords(u'André eats apple-pie')
[u'André', u'eats', u'apple', u'pie']

With

getWords = lambda text: re.compile(r'[A-Za-z0-9]+').findall(text)

it works for the first example, but not the second, and the other way around with this:

getWords = lambda text: re.compile(r'\w+', re.UNICODE).findall(text)
rumpel
  • 7,870
  • 2
  • 38
  • 39

1 Answers1

1

You can use str.isalnum() instead of RegEx in this case:

getWords = lambda x: ''.join(i if i.isalnum() else ' ' for i in x).split()
Remi Guan
  • 21,506
  • 17
  • 64
  • 87