Regex split fails at cyrillic string

Question

I need a simple and easy way to isolate words from strings of different languages. I know this is not a trivial task, but I just want to split on common punctuation like .,;:?!@#. Currently I'm using:

x = "this is sparta, or not."
print re.split([^-\w]', x)
['this', 'is', 'sparta', '', 'Or', 'not', '']

But, when I use a Cyrillic string:

x =  u'правил произношение суффиксов можно иногда'
w = re.split(r'[^-\w]', x)

I get:

[u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']

How can I make a single generic splitter, which solves this problem? Thank you!

EDIT: The issue above is on Python 2.7.10.

I copy and paste your code and cannot reproduce your issue on python3.6 — Haifeng Zhang, Jun 08 '17 at 18:48

score 4 · Accepted Answer · answered Jun 08 '17 at 19:02

4

Try this:

re.split(r'\W', x, flags=re.UNICODE)

It worked for me on 2.7.13.

answered Jun 08 '17 at 19:02

Błotosmętek

12,717
19
29

That's it! Damn flags, thank you! – Fernando Jun 08 '17 at 19:03
Will this work for Python 3 too? – Fernando Jun 08 '17 at 19:04
Yes, but it will also work without specifying the flag, as Unicode is the default for Python3. – Błotosmętek Jun 08 '17 at 19:16

score 0 · Answer 2 · answered Jun 08 '17 at 18:56

0

I copy and past your code on Python3 console, everything works but when I try it on Python2.7 it had same issue as you got.

That's unicode issue.

x =  u'правил произношение суффиксов можно иногда'
myinput = raw_input(x.encode('utf8'))
w = re.split(r'[^-\w]', myinput)

answered Jun 08 '17 at 18:56

Haifeng Zhang

30,077
19
81
125

raw_input? My console keeps waiting for me whan I use this. – Fernando Jun 08 '17 at 19:02
>>> x = u'правил произношение суффиксов можно иногда' >>> myinput = raw_input(x.encode('utf8')) правил произношение суффиксов можно иногда – Haifeng Zhang Jun 08 '17 at 19:06
@HaifengZhang Your answer is a bit misleading. You're simply passing x to raw_input as a prompt. The output of raw_input is of type str with utf-8 encoded (most probably but not certainly). Your answer certainly doesn't cover how to do it on unicode strings. – Loïc Faure-Lacroix Jun 08 '17 at 20:29

Regex split fails at cyrillic string

2 Answers2