8

Once upon a time, I found this question interesting.

Today I decided to play around with the text of that book.

I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.

#!/usr/bin/env python3.2
# coding=UTF-8

import sys, re

for file in sys.argv[1:]:
    f = open(file)
    fs = f.read()
    regexnl = re.compile('[^\s\w.,?!:;-]')
    rstuff = regexnl.sub('', f)
    f.close()
    print(rstuff)

Something very similar has already been done in this answer.

Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

Community
  • 1
  • 1
magnetar
  • 6,487
  • 7
  • 28
  • 40
  • 1
    You are calling `.close` on a `str` object (`f`), and your `print` is invalid syntax for Python 3: maybe just typos? – huon Jun 11 '12 at 13:53

3 Answers3

11

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

huon
  • 94,605
  • 21
  • 231
  • 225
  • I pip installed this, but the installation defaulted to 2.7... I need to find a way to use the module with Python3.2. Any tips on how to pip install for Python3.2? I couldn't find any documentation on that. – magnetar Jun 12 '12 at 00:33
  • 1
    Do you have the `pip-3.2` program? (As in does `pip-3.2 install regex` work?) – huon Jun 12 '12 at 01:31
  • i don't, but this is what i ended up doing anyway. i can't believe python3.2 still doesn't have sane unicode support built in. – magnetar Jun 12 '12 at 13:39
10

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.

Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.

#coding=utf-8
import re

ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"

cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)

for x in cyril1:
    print x

for x in cyril2:
    print x

output:

Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже

Addition:

Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:

  • re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
  • re.findall("\w+", text, re.UNICODE) is equivalent

So, more specifically for your problem: * re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.

See here (point 7)

Junuxx
  • 14,011
  • 5
  • 41
  • 71
  • for some reason, i can't get this to work. i'm using `regexnl = re.compile('[^\u0400-\u0500\s.,?!:;-]+')` -- does that look wrong to you? – magnetar Jun 11 '12 at 20:43
  • @magnetar: You're missing the Unicode string indicator (`u'some string'` instead of `'some string'`) - try `re.compile(u'[^\u0400-\u0500\s.,?!:;-]+')` – Junuxx Jun 11 '12 at 20:55
  • Not all Cyrillic characters are in U+0400 to U+04FF. http://www.unicode.org/charts/ shows that the Cyrillic Supplement range is from U+0500..U+052F, Cyrillic Extended A is from U+2DE0 .. U+2DFF and Cyrillic Extended B is from A640..A69F. In theory, new Cyrillic characters could be added anywhere; you need to use a package that supports the Unicode script property (like dbaupp showed) to get the Cyrillic characters. – prosfilaes Jun 11 '12 at 21:16
  • @prosfilaes: See the solutions at the end of my answer. – Junuxx Jun 11 '12 at 21:18
  • @Junuxx I've tried everything you suggested, but no luck. But here's the first few hundred pages of the text itself: http://pastebin.com/ns1cQjX3 ...maybe there's still something I'm missing. – magnetar Jun 12 '12 at 00:32
  • 1
    @Junuxx: If he is using Python 3 (as he says), there's no need for `u''`, because strings are unicode by default. – Thomas K Jun 12 '12 at 11:54
  • Cyrilic != Russian and for russian more precise is to use `RUSSIAN_CHARS_REGEX = re.compile(u"[\u0430-\u044f\u0451\u0401\u0410-\u042f]+")` assuming `russian_alphabet = u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"` – mrgloom Oct 11 '17 at 17:05
-2

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".

//Basic Cyr Unicode range for use on Russian websites. 0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463

Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

ommunist
  • 1
  • 1
  • Just to assure you that my answer is OK solution for Zorich's book. This subset covers its Cyrillic glyphs 100%. It is also useful solution for websites that use an excessive amount of bandwidth by using full subsets from Adobe TypeKit and similar services. No need to be sorry, you just follow your own rules. I answered the question and just mentioned the additional benefit. – ommunist Nov 16 '14 at 19:36