How to make grep [A-Z] independent of locale?

Question

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:

$ echo T | grep [A-Z]

No match.

How come T is not within A-Z range?

I changed the regex a tiny bit:

$ echo T | grep [A-Y]

A match!

Whoa! How is T within A-Y but not within A-Z?

Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY

$ echo $LANG
et_EE.UTF-8

This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?

After trying several things I arrived at the following solution:

$ echo T | LANG=C grep [A-Z]

Is this the recommended way to make grep locale-independent?

Further more... would it be safe to define an alias like that:

$ alias grep="LANG=C grep"

PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).

Please try all that again, but quote your expression `grep '[A-Z]'`, just to make sure the shell isn't expanding that. — Mat, Jul 23 '11 at 10:47
Works the same only because you don't have a file named A through Z. The shell tried to expand [A-Z], didn't find anything, and left it alone. Use quotes to always pass patterns to grep. — Gilbert, Jul 24 '11 at 11:10
Thanks, I didn't actually know that Bash supports expanding that kind of things. But I've been actually always quoting the grep arguments anyway - I just thought I'll leave them off to keep code samples shorter. Now smarter again. — Rene Saarsoo, Jul 25 '11 at 16:33
The question "Is this the recommended way to make grep locale-independent?" isn't well-posed. `grep` has to use *some* locale; it can't operate without any locale at all. I think you might want to rephrase to ask "Is this how to make grep use the old ASCII locale I'm used to?", which does have an answer: Yes. `LC_ALL=C` will give you that old ASCII character set and collating order. `LC_COLLATE=C` will allow the full local locale character set but make sure it sorts in the familiar ASCII way. Is that what you want? — Ian D. Allen, Sep 22 '14 at 04:28
This is documented in [Character Classes and Bracket Expressions](https://linux.die.net/man/1/grep). — Vlastimil Ovčáčík, Nov 04 '18 at 11:13
tl;dr `LANG=C grep...` or `grep -P ...` are independent of locale. — Vlastimil Ovčáčík, Nov 04 '18 at 12:02

Gilbert · Answer 1 · 2011-07-23T12:33:29.087

4

POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.

   grep '[[:upper:]]'

As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.

With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.

[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

edited Jul 23 '11 at 12:33

answered Jul 23 '11 at 11:17

Gilbert

3,740
17
19

1

So you say [A-Z] remains strictly in C locale? But my whole question was about it not being in C locale. – Rene Saarsoo Jul 23 '11 at 11:33
Try setting the LC_ALL environment variable rather than LANG. – Gilbert Jul 23 '11 at 12:03
LC_ALL is probably a better variable to use than LANG as it's the one the grep checks for first. But currently it doesn't make any difference for me. – Rene Saarsoo Jul 23 '11 at 12:46
1

Using the date command is a quick way to test if locales are enabled: LC_ALL=et_EE date – Gilbert Jul 23 '11 at 23:53

How to make grep [A-Z] independent of locale?

1 Answers1

Linked