Bash string lexicographical comparisons inconsistency

Question

Bash manual section 6.4 describes [[ string1 < string2 ]] as

True if string1 sorts after string2 lexicographically in the current locale.

I am using a stock English language Linux and was expecting my current locale is ASCII where period [.] is lexicographically less than [0-9A-Za-z]. However, take a look at these:

$ echo $BASH_VERSION
4.3.11(1)-release
$ [[ "." < "1" ]] && echo "yes"
yes
$ [[ "A" < "B" ]] && echo "yes"
yes
$ [[ ".A" < "1B" ]] && echo "yes"
$

The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?

Here is the output of locale:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

My OSX (high siera) has Bash 3.2.57. This "problem" started with Bash 4.1, which changed [[ < ]] to use current locale. Your explanation of Unicode and `strcoll` seems spot on. — Michael Chen, Jun 27 '21 at 15:24
Overall, basically, you'll want to get through https://unicode.org/reports/tr10 standard. Or research https://code.woboq.org/userspace/glibc/string/strcoll_l.c.html#do_compare and https://code.woboq.org/userspace/glibc/locale/weight.h.html#findidx . — KamilCuk, Jun 27 '21 at 16:35

oguz ismail · Answer 1 · 2021-06-27T16:29:36.337

This doesn't have much to do with your shell. To perform a locale-dependent lexicographic comparison of .A and 1B, bash simply calls strcoll(".A", "1B"), and interprets the return value, that's all.

    {
#if defined (HAVE_STRCOLL)
      if (shell_compatibility_level > 40 && flags & TEST_LOCALE)
    return ((op[0] == '>') ? (strcoll (arg1, arg2) > 0) : (strcoll (arg1, arg2) < 0));
      else
#endif
    return ((op[0] == '>') ? (strcmp (arg1, arg2) > 0) : (strcmp (arg1, arg2) < 0));
    }

^{(copied from test.c)}

Above excerpt also reveals that in order to force a byte-by-byte comparison without altering locale settings, one needs to change the shell compatibility level to 40 (which stands for 4.0, the last version of bash which behaves the way you expected by default).

$ shopt -s compat40
$ [[ .A < 1B ]] && echo yes
yes
$

Now, as to your question (The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?), well, it's your locale's collation order apparently. Under What Collation is NOT, UCA specification says:

Collation order is not preserved under concatenation or substring operations, in general.

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

Which, I think, corroborates that this is not a bug but a feature.

Then I don't understand `strcoll(".A", "1B")` returning greater than zero. The two bytes of ".A" do not form a 16-bit Chinese character or any other language. They are still ASCII values in UTF-8. — Michael Chen, Jun 27 '21 at 15:48
See [strcoll](https://www.cplusplus.com/reference/cstring/strcoll/) *"A value greater than zero indicates that the first character that does not match has a greater value in str1 than in str2;"* — Michael Chen, Jun 27 '21 at 15:53
@MichaelChen I didn't say they do. Bash is written in C, not C++; and [C standard](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf) doesn't mention a character by character comparison. In your locale, the sequence `.A` collates after `1B`. — oguz ismail, Jun 27 '21 at 16:26

score 1 · Answer 2 · answered Jun 27 '21 at 21:14

UTF-8 collation order doesn't go character-by-character, like traditional ASCIIbetical collation does. It uses a multi-level comparison, in which some types of differences are prioritized over others even if they occur later in the string. In this case, what you're seeing the result of "Base character" order ("A" < "1B") being prioritized over a punctuation difference. Here's a quote from the standard:

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, the most important feature is the identity of the base letters—for example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ. Case differences (uppercase versus lowercase), are typically ignored, if the base letters or their accents differ. Treatment of punctuation varies. In some situations a punctuation character is treated like a base letter. In other situations, it should be ignored if there are any base, accent, or case differences. [...]

Here's an example showing the prioritization of punctuation vs "base characters":

$ printf '%s\n' {,.,-}{,1,A,AB,B,BA} | LANG=en_US.UTF-8 sort
-
.
-1
.1
1
-A
.A
A
-AB
.AB
AB
-B
.B
B
-BA
.BA
BA

Note that the punctuation only matters to break ties between lines containing the same base characters. You can also see similar effects involving capitalization and accents:

printf '%s\n' {a,A,B}{A,Å,B} | LANG=en_US.UTF-8 sort
aA
AA
aÅ
AÅ
aB
AB
BA
BÅ
BB

Note that the accent on the second character has higher priority than the capitalization of the first character (and punctuation anywhere in the string would have lower priority than either).

(And, of course, there are lots of other complications beyond this.)

Bash string lexicographical comparisons inconsistency

2 Answers2

Linked