17

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

Charles
  • 50,943
  • 13
  • 104
  • 142
slim
  • 40,215
  • 13
  • 94
  • 127
  • 1
    Conversion with `sed` works for me: `echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'` – Lev Levitsky Nov 14 '12 at 15:37
  • 1
    `echo "Г" | tr [:upper:] [:lower:]` outputs "г" properly on a Mac OS X 10.8 system. – ulidtko Nov 14 '12 at 16:08
  • Thanks @LevLevitsky . That's a suitable fix for me (feel free to promote it into an answer). I wonder why tr doesn't work. – slim Nov 14 '12 at 16:08
  • @ulidtko Interesting, what version of `tr` is it? – Lev Levitsky Nov 14 '12 at 16:57
  • OSX tr is BSD tr. The manpage says that historically LC_ALL was ignored, and now it is not. I guess that implies unicode is supported. https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man1/tr.1.html – slim Nov 14 '12 at 17:03
  • `uname | tr "[:upper:]" "[:lower:]"` output `Linlx` on openwrt. tr is busybox 1.34.1 – martian Nov 27 '21 at 21:38
  • per macos mapage The LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of tr as described in environ(7). – ShpielMeister Jul 26 '23 at 07:11

2 Answers2

13

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • 3
    Yes, the single-byte issue is true. I once reported this as a bug to GNU and they explained this is so by design (i.e. they would have to break compatibility with old software in order to fix it). I also discussed it on a mailing list and was similarly [told it was supposed to be that way](http://lists.gnu.org/archive/html/bug-coreutils/2004-10/msg00063.html). – Michał Kosmulski Nov 14 '12 at 18:28
  • 2
    Remember to add g flag to the regular expression, if you want to replace all occurrences. – Seppo Enarvi Jan 09 '15 at 15:22
  • 1
    If you add a space to the beginning, it will not work: `echo ' СТЭК' | sed 's/[[:upper:]]*/\L&/'` => ' СТЭК'. It seems this one works better: `echo ' СТЭК' | sed 's/.*/\L&/'` => ' стэк'. Tested on GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) – Vadik Sirekanyan Jul 07 '18 at 23:36
  • Just need to add g: 's/[[:upper:]]*/\L&/g' – Hibou57 Apr 28 '20 at 03:28
0

This work for me:

echo ЫЕРУНКЫКТ | sed -e 's/\(.*\)/\L\1/'
Daniel abzakh
  • 430
  • 5
  • 10