How to match alphanumeric characters in python regexp?

Question

I’d like to get all the words from a text, including unicode characters, not including hyphens or underscores or any other non-alphanumeric characters.

I.e. I want something like this:

>>> getWords('John eats apple_pie')
['John', 'eats', 'apple', 'pie']
>>> getWords(u'André eats apple-pie')
[u'André', u'eats', u'apple', u'pie']

With

getWords = lambda text: re.compile(r'[A-Za-z0-9]+').findall(text)

it works for the first example, but not the second, and the other way around with this:

getWords = lambda text: re.compile(r'\w+', re.UNICODE).findall(text)

@jonrsharpe: Nope, it’s finished I guess. Are you missing more information? — rumpel, Jan 09 '16 at 14:47
Ah, no; I was expecting another sentence at the end but now I see what you mean. — jonrsharpe, Jan 09 '16 at 14:47
Hi @rumpel the duplicated post answers your problem, AFAIK. If you think it doesn't please ping me back. — Bhargav Rao, Jan 09 '16 at 14:48
@Tushar, I want that to be generic. I.e. there a thousands of non-alphanumeric characters that I don’t want to have to list manually. — rumpel, Jan 09 '16 at 14:55
@BhargavRao: Thanks. In the duplicate are indeed some solutions that should work which use manual parsing instead of regular expressions. Too bad python’s regexps can’t deal with this. — rumpel, Jan 09 '16 at 14:59
@rumpel What makes you think, `split` is not generic, when your delimiters are fixed and text can contain any character. IMO, `split` will work on Chinese, Japanese characters. — Tushar, Jan 09 '16 at 15:00
@Tushar: I mean the delimiters aren’t fixed either. Could be some non-breaking space or – or … or ☺. So this list while fixed, would be very very long. — rumpel, Jan 09 '16 at 15:04

Remi Guan · Accepted Answer · 2016-01-09T15:18:00.427

1

You can use str.isalnum() instead of RegEx in this case:

getWords = lambda x: ''.join(i if i.isalnum() else ' ' for i in x).split()

edited Jan 09 '16 at 15:18

answered Jan 09 '16 at 14:50

Remi Guan

21,506
17
64
87

Oh yeah, use `str.isalnum` instead if you don't think that digit or other stuff is a part of words. – Remi Guan Jan 09 '16 at 15:10

How to match alphanumeric characters in python regexp?

1 Answers1