Regular expression [A-Za-z] seems to not include letter W and w

Question

For some reason, I don't know why, maybe something isn't quite right in my system or in my brain, the regular expression "[A-Z]" doesn't seem to recognise the letter ”W” and "[a-z]" doesn't seem to recognise the letter ”w”. Example:

for x in A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v W w X x Y y Z z; do echo $x | egrep "[A-Za-z]"; done

My output is: A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v X x Y y Z z

As you can see, letters ”W” and ”w” are both missing. Am I the only one? What could possibly cause this? If it's a bug, where do I report it? This happens in bash and zsh and it happens in sed and egrep (and possibly more, I only tested those two), so the problem seems to be about regular expressions in general… :o So… what is going on??

Manjaro 17.1.12
XFCE 4.12
bash 4.4.23(1)-release (x86_64-unknown-linux-gnu)
zsh 5.5.1 (x86_64-unknown-linux-gnu)
egrep 3.1
sed 4.5

Edit: Someone asked for my locale, so here it is.

$ locale        
LANG=sv_SE.utf8
LC_CTYPE="sv_SE.utf8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="sv_SE.utf8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="sv_SE.utf8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

If this is the problem, then I guess whatever decides what sv_SE.UTF-8 is, is wrong, because the letter ”w” was added to the Swedish alphabet in 2006. Also, if the A-Z interval is dependent on the current locale, shouldn't [A-Ö] work for the whole Swedish alphabet when locale is set to Swedish? It doesn't, it gives an error message. However [[:alpha:]] seems to include all Swedish letters, so I guess I'm happy with that.

What's your locale? This isn't something done by the shell, it's something done by your platform's standard C library. Anyhow, character collation order and category locale runtime configuration (`LC_CTYPE`, `LC_COLLATE`, and other variables like `LC_ALL` that can override those) is relevant (no, necessary) to debug this. — Charles Duffy, Sep 29 '18 at 16:01
...if you're on a system configured for a language that doesn't include `W` in its alphabet, there you are. Please edit the output from the `locale` command into your question. — Charles Duffy, Sep 29 '18 at 16:01
BTW, `[A-Za-z]` is bad form anyhow (and "teaching resources" advising you to use it shouldn't be trusted); use `[[:upper:]]` instead of `[A-Z]` and `[[:lower:]]` in place of `[a-z]`, or `[[:alpha:]]` for both, or else collation orders that use `AaBb...Zz` instead of `A..Za..z` will mess you up. — Charles Duffy, Sep 29 '18 at 16:02
(because this is all standard-C-library functionality, it's more a general UNIX question that a question specifically about `sed` or `grep` or whatnot; they all call the same shared library code for either regex or fnmatch pattern/glob support). — Charles Duffy, Sep 29 '18 at 16:07
@Charles: It's true that [[:alpha:]] is usually better than [A-Za-z], and it's worthwhile pointing that out. But let's not lose track of the fact that there is a bug in glibc, here. The regular expression `[u-x]` should match both `v` and `w`, even in Swedish. But it doesn't. However, `w` does sort correctly and it does show up with the right ctype. — rici, Oct 01 '18 at 04:20
I agree -- this smells like a glibc bug where they failed to roll in the 2006 alphabet change. Not our circus, not our monkeys, though -- per https://sourceware.org/glibc/wiki/FilingBugs, it should be reported to the OP's Linux distro (which appears to be Manjaro), and *they* should in turn push a patch upstream to glibc proper. — Charles Duffy, Oct 01 '18 at 11:51

rici · Answer 1 · 2018-10-02T17:58:20.457

Technically speaking, using range expressions such as [a-z] in a Posix regular expression (as with the grep utility) only has specified behaviour in the Posix (C) locale. That means that you really cannot reliably use range expressions in the sv_SE locale (or any other internationalised locale). You can, however, reliably use character classes, such as [[:lower:]], [[:alpha:]], [[:alnum:]], and so on, and that is normally what you should do.

Having said that, I believe that what you are experiencing is indeed a bug in glibc introduced in v2.28, since previous versions of the sv_SE locale correctly placed w in lower-case ranges and W in upper-case ranges. I think the change does not match user expectations, since it will break regex range expressions which previously worked as expected despite having unspecified behaviour.

The problem was reported as a glibc bug about a month ago, and almost immediately closed for lack of documentation; yesterday, I requested that it be reopened. (Update: that bug was requalified as a duplicate of another bug whose eventual solution can only be a comprehensive solution to the underlying design issue. In other words, the glibc team understand that there is a problem but don't hold your breath for a solution.)

I've put a possible replacement sv_SE locale definition file in this repository, in case it proves to be useful to someone. Please don't install it unless you are experiencing problems with the locale definition from glibc.

My excessively long comment in the bug report linked above tries to lay out the problem, which is more a problem of definition than implementation. The essential problem is that it is very difficult (if not impossible) to define a single-character collation order which is completely consistent with a whole-string comparison order. Reading between the lines in the Posix rationale document, it seems clear that a lot of people banged their heads against this particular brick wall without ever managing to come up with a practical portable proposal with implementation consensus. ("As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations.")

A well-intentioned cleanup of the various locale definition files resulted in a change to the character ordering in the Swedish locale. It did not alter the string sortation order, so that V and W continue to be sorted as before (that is, as though they were variant spellings of the same letter rather than different letters), and it did not alter the CTYPE definitions, so W and w continue to be letters (and thus match [[:alpha:]]) as they were before. But it did (accidentally, I believe) alter the character order. Before, W followed V and w followed v, so that W matched [U-X] and w matched [u-x]. The change placed both characters after thorn (þ), which means it cannot match any range expression. (Regex range expressions are limited to single-byte codepoints.)

A previous question had been suggested as a duplicate of this question, but I removed the duplicate marker because that question focuses on the wisdom of using [a-z] and not on possible implementation errors, and also because is is about Perl regexes rather than Posix regexes. However, there is a lot of useful information in the answers.

score 0 · Answer 2 · answered Sep 30 '19 at 21:25

This is NOT recommended as a "final solution" but might help someone somehow...

I found out that editing

/usr/share/i18n/locales/sv_SE

and commenting out the last two lines in this section resolved the issue.

% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'.  Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.

% &v<<<V<<w<<<W
%<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
%<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w

and after that regenerating the locale

sudo locale-gen

made things a little better...

The comments in that file are not correct anymore. Since 2006 the letter w is a part of the Swedish alphabet and sorted between v and x, just like in English and many other languages. — user8179, Jan 12 '20 at 15:59

Regular expression [A-Za-z] seems to not include letter W and w

2 Answers2

Linked