I'm trying to extract a word list from a Russian short story.
#!/bin/sh
export LC_ALL=ru_RU.utf8
sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq
However the tr
step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!
$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г
In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).
What is a portable, reliable way to lowercase unicode text in a pipe?