tr [:upper:] [:lower:] with Cyrillic text

Question

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

Conversion with `sed` works for me: `echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'` — Lev Levitsky, Nov 14 '12 at 15:37
`echo "Г" | tr [:upper:] [:lower:]` outputs "г" properly on a Mac OS X 10.8 system. — ulidtko, Nov 14 '12 at 16:08
Thanks @LevLevitsky . That's a suitable fix for me (feel free to promote it into an answer). I wonder why tr doesn't work. — slim, Nov 14 '12 at 16:08
OSX tr is BSD tr. The manpage says that historically LC_ALL was ignored, and now it is not. I guess that implies unicode is supported. https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man1/tr.1.html — slim, Nov 14 '12 at 17:03
`uname | tr "[:upper:]" "[:lower:]"` output `Linlx` on openwrt. tr is busybox 1.34.1 — martian, Nov 27 '21 at 21:38
per macos mapage The LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of tr as described in environ(7). — ShpielMeister, Jul 26 '23 at 07:11

Lev Levitsky · Accepted Answer · 2012-11-14T16:56:39.747

13

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк

edited Nov 14 '12 at 16:56

answered Nov 14 '12 at 16:40

Lev Levitsky

63,701
20
147
175

3

Yes, the single-byte issue is true. I once reported this as a bug to GNU and they explained this is so by design (i.e. they would have to break compatibility with old software in order to fix it). I also discussed it on a mailing list and was similarly [told it was supposed to be that way](http://lists.gnu.org/archive/html/bug-coreutils/2004-10/msg00063.html). – Michał Kosmulski Nov 14 '12 at 18:28
2

Remember to add g flag to the regular expression, if you want to replace all occurrences. – Seppo Enarvi Jan 09 '15 at 15:22
1

If you add a space to the beginning, it will not work: `echo ' СТЭК' | sed 's/[[:upper:]]*/\L&/'` => ' СТЭК'. It seems this one works better: `echo ' СТЭК' | sed 's/.*/\L&/'` => ' стэк'. Tested on GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) – Vadik Sirekanyan Jul 07 '18 at 23:36
Just need to add g: 's/[[:upper:]]*/\L&/g' – Hibou57 Apr 28 '20 at 03:28

score 0 · Answer 2 · answered May 19 '21 at 10:29

0

This work for me:

echo ЫЕРУНКЫКТ | sed -e 's/\(.*\)/\L\1/'

answered May 19 '21 at 10:29

Daniel abzakh

430
5
10

tr [:upper:] [:lower:] with Cyrillic text

2 Answers2