0

I need a simple and easy way to isolate words from strings of different languages. I know this is not a trivial task, but I just want to split on common punctuation like .,;:?!@#. Currently I'm using:

x = "this is sparta, or not."
print re.split([^-\w]', x)
['this', 'is', 'sparta', '', 'Or', 'not', '']

But, when I use a Cyrillic string:

x =  u'правил произношение суффиксов можно иногда'
w = re.split(r'[^-\w]', x)

I get:

[u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']

How can I make a single generic splitter, which solves this problem? Thank you!

EDIT: The issue above is on Python 2.7.10.

Fernando
  • 7,785
  • 6
  • 49
  • 81

2 Answers2

4

Try this:

re.split(r'\W', x, flags=re.UNICODE)

It worked for me on 2.7.13.

Błotosmętek
  • 12,717
  • 19
  • 29
0

I copy and past your code on Python3 console, everything works but when I try it on Python2.7 it had same issue as you got.

That's unicode issue.

x =  u'правил произношение суффиксов можно иногда'
myinput = raw_input(x.encode('utf8'))
w = re.split(r'[^-\w]', myinput)
Haifeng Zhang
  • 30,077
  • 19
  • 81
  • 125
  • raw_input? My console keeps waiting for me whan I use this. – Fernando Jun 08 '17 at 19:02
  • >>> x = u'правил произношение суффиксов можно иногда' >>> myinput = raw_input(x.encode('utf8')) правил произношение суффиксов можно иногда – Haifeng Zhang Jun 08 '17 at 19:06
  • @HaifengZhang Your answer is a bit misleading. You're simply passing x to raw_input as a prompt. The output of raw_input is of type str with utf-8 encoded (most probably but not certainly). Your answer certainly doesn't cover how to do it on unicode strings. – Loïc Faure-Lacroix Jun 08 '17 at 20:29