Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.