I need a simple and easy way to isolate words from strings of different languages. I know this is not a trivial task, but I just want to split on common punctuation like .,;:?!@#
. Currently I'm using:
x = "this is sparta, or not."
print re.split([^-\w]', x)
['this', 'is', 'sparta', '', 'Or', 'not', '']
But, when I use a Cyrillic string:
x = u'правил произношение суффиксов можно иногда'
w = re.split(r'[^-\w]', x)
I get:
[u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']
How can I make a single generic splitter, which solves this problem? Thank you!
EDIT: The issue above is on Python 2.7.10.