11

I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.

While investigating a bug I've encountered strange behavoir

$ export LC_ALL=en_US.UTF-8   
$ sort part-r-00000 | uniq -d 
ɥ ɨ ɞ ɧ 251
ɨ ɡ ɞ ɭ ɯ       291
ɢ ɫ ɬ ɜ 301
ɪ ɳ     475
ʈ ʂ     565

$ export LC_ALL=C
$ sort part-r-00000 | uniq -d 
$ # no duplicates found

The duplicates also appear when running a custom C++ program that reads the file using std::stringstream - it fails due to duplicates when using en_US.UTF-8 locale. C++ seems to be unaffected at least for std::string and input/output.

Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?

What transformations does the locale to the text that causes this behavoir?

Edit: Here is a small example

$ uniq -D duplicates.small.nfc 
ɢ ɦ ɟ ɧ ɹ       224
ɬ ɨ ɜ ɪ ɟ       224
ɥ ɨ ɞ ɧ 251
ɯ ɭ ɱ ɪ 251
ɨ ɡ ɞ ɭ ɯ       291
ɬ ɨ ɢ ɦ ɟ       291
ɢ ɫ ɬ ɜ 301
ɧ ɤ ɭ ɪ 301
ɹ ɣ ɫ ɬ 301
ɪ ɳ     475
ͳ ͽ     475
ʈ ʂ     565
ˈ ϡ     565

Output of locale when the problem appears:

$ locale 
LANG=en_US.UTF-8                                                                                                                                                                                               
LC_CTYPE="en_US.UTF-8"                                                                                                                                                                                         
LC_NUMERIC=de_DE.UTF-8                                                                                                                                                                                         
LC_TIME=de_DE.UTF-8                                                                                                                                                                                            
LC_COLLATE="en_US.UTF-8"                                                                                                                                                                                       
LC_MONETARY=de_DE.UTF-8                                                                                                                                                                                        
LC_MESSAGES="en_US.UTF-8"                                                                                                                                                                                      
LC_PAPER=de_DE.UTF-8                                                                                                                                                                                           
LC_NAME=de_DE.UTF-8                                                                                                                                                                                            
LC_ADDRESS=de_DE.UTF-8                                                                                                                                                                                         
LC_TELEPHONE=de_DE.UTF-8                                                                                                                                                                                       
LC_MEASUREMENT=de_DE.UTF-8                                                                                                                                                                                     
LC_IDENTIFICATION=de_DE.UTF-8                                                                                                                                                                                  
LC_ALL=                   

Edit: After normalisation using:

cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc

I still get the same results

Edit: The file is valid UTF-8 according to iconv - (from here)

$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0

Edit: Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html

It's working on FreeBSD

Community
  • 1
  • 1
kei1aeh5quahQu4U
  • 143
  • 1
  • 9
  • 23
  • It might help if you provided a small file with the actual lines that show different behavior. Delete lines until you get exactly one pair of lines that are duplicate in the en_US locale and not in C. – Random832 Nov 26 '13 at 20:10
  • It's 1.1gb file. I'm looking into it. – kei1aeh5quahQu4U Nov 26 '13 at 20:19
  • Well, first, see if it happens in the first 10,000 lines or so (using the `head` command) - or maybe grep for one of the characters that appears in one of the lines output by uniq. – Random832 Nov 26 '13 at 20:33
  • the uniq output doesn't help, since it doesn't show the actual content of the two different-in-C-locale lines that it considered to be equivalent. Really, you should be able to cut things down until you have a file consisting of exactly two lines, which the en_US UTF-8 locale considers duplicates and C does not. – Random832 Nov 26 '13 at 21:04
  • For example, the first line of your duplicates list was `À Mineira Restaurante , 1098`. How large a result do you get if you do "grep Restaurante part-r-00000"? Only one example is needed, so you shouldn't need to post a huge file. – Random832 Nov 26 '13 at 21:05
  • In the case where there are unwanted duplicates, what is the output of `uniq -D | hexdump -C` (uppercase `-D` instead of `-d`)? That will print out *all* of the duplicated lines and dump them to hex to get the raw byte data. – Adam Rosenfield Nov 26 '13 at 21:16
  • Hi, sorry - fixed the problem - the difference is here: http://webis5.medien.uni-weimar.de/tmp/duplicates.small – kei1aeh5quahQu4U Nov 26 '13 at 21:20
  • https://gist.github.com/anonymous/5531aacd445f30929e36 is the result – kei1aeh5quahQu4U Nov 26 '13 at 21:23
  • @mt_: That data makes no sense. I can't think of any reason why `ɢ` (U+0262, LATIN LETTER SMALL CAPITAL G) and `ɬ` (U+026C, LATIN SMALL LETTER L WITH BELT) would ever be treated as identical. – Adam Rosenfield Nov 26 '13 at 22:18
  • I know. It's not happening on FreeBSD, it's happening on Ubuntu 10.04, 12.04 and CentOS 6.4 so far. I've just donwloaded the example file each time and ran `uniq -d` with different `LC_ALL` settings. I've found some links (added in the question) that _might_ explain it but I'm not sure. – kei1aeh5quahQu4U Nov 26 '13 at 22:22
  • @AdamRosenfield: the OP is specifying an English locale, where those letters aren't used, ergo they collate equal, ergo they are treated as equal by `uniq`. Either use an appropriate locale, or use the C or POSIX locale. – ninjalj Nov 27 '13 at 00:20
  • @ninjalj: But this totally unexpected. If you process UTF-8 in C++ and the locale is used by `std::string` comparison I'll get just wrong results. – kei1aeh5quahQu4U Nov 27 '13 at 00:25
  • @mt_: Well, which results did you expect? Should _æ_ and _ae_ be treated as equals? _ß_ and _SS_? _ü_ and _ue_? The moment you specify a locale you select a very specific (and very limited, to a very local culture) behavior. – ninjalj Nov 27 '13 at 00:29
  • @ninjalj: So in order to be safe I should resort to set `C` internally to be sure I compare always the bytes? – kei1aeh5quahQu4U Nov 27 '13 at 00:37
  • Since your example has many languages, and there is no metadata telling in which language is each line, that is probably your best bet. Note that if you had, e.g: `äußerte` and (I think) `äusserte` they should theoretically be treated as equal, and for Spanish you should typically ignore accents when sorting, etc... – ninjalj Nov 27 '13 at 00:51
  • Thanks. I just discovered that the problem is unrelated to C++. I don't care about `uniq` that much. – kei1aeh5quahQu4U Nov 27 '13 at 00:55
  • 1
    @ninjalj: The answers to "Should `æ` and `ae` be treated as equals? `ß` and `SS`? `ü` and `ue`?" obviously depend on the current locale. But what about `ß` and `א`? `θ` and `カ`? `∫` and ``? There is no locale I know of that has any of these pairs as equivalents for sorting. It doesn't really matter whether one sorts before the other, as long as the ordering is well-defined and consistent. There's no good reason why they should sort as equal to each other in any locale. Likewise for `ɢ` and `ɬ`. – Adam Rosenfield Nov 27 '13 at 15:56

3 Answers3

7

I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ

Obviously, if the locale is en_US.UTF-8 uniq treats ɢ and ɬ as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>
#include <cstring>
#include <clocale>

int main()
{
    const char* s1 = "\xc9\xa2";
    const char* s2 = "\xc9\xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::cout << std::endl;

    s1 = "\xa2";
    s2 = "\xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;
}

Output:

ɢ
ɬ
0
-1
-10
-1

�
�
0
-1
-10
-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?

  • 2
    `strcoll()` doesn't just compare characters: it compares characters for sorting. Since there is no sorting (collation) order defined in English for the characters you used, they `strcoll()` in the same place (i.e. they are in the same place in the order formed by the `strcoll()` function). – ninjalj Nov 27 '13 at 10:09
  • @ninjalj: I think this answers the whole question. It makes sense to me to not define an order for characters that are not part of the own/used language. The specification can be found here [Unicode Collation Algorithm](http://www.unicode.org/reports/tr10/). – Martin Trenkmann Nov 27 '13 at 12:08
  • 1
    In your last example `"\xa2"` and `"\xac"` are not valid UTF-8 strings, it makes no sense to collate them under UTF-8-encoded locale. Just a nitpick. – n. m. could be an AI Nov 27 '13 at 14:56
  • For me this is a bug in `uniq`. Unknown and invalid strings as well as characters where no collation is definied should be compared bytewise. – kei1aeh5quahQu4U Nov 27 '13 at 15:55
  • Why is behavior then different on FreeBSD and Mac OS X? – kei1aeh5quahQu4U Nov 27 '13 at 15:56
  • @mt_ POSIX mandates that "Comparisons \[...\] shall be performed using the collating sequence of the current locale". No ifs or buts. – n. m. could be an AI Nov 27 '13 at 18:56
  • 1
    @n.m: You are right. But I suspect the algorithm is not correctly implemented: See http://www.unicode.org/reports/tr10/#Main_Algorithm Step S2.2 - the chars here are not in the table I suspect. They should still have a different weight according to 7.1.2 in the same document. – kei1aeh5quahQu4U Nov 27 '13 at 19:19
3

It could be due to Unicode normalization. There are sequences of code points in Unicode which are distinct and yet are considered equivalent.

One simple example of that is combining characters. Many accented characters like "é" can be represented as either a single code point (U+00E9, LATIN SMALL LETTER E WITH ACUTE), or as a combination of both an unaccepted character and a combining character, e.g. the two-character sequence <U+0065, U+0301> (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT).

Those two byte sequences are obviously different, and so in the C locale, they compare as different. But in a UTF-8 locale, they're treated as identical due to Unicode normalization.

Here's a simple two-line file with this example:

$ echo -e '\xc3\xa9\ne\xcc\x81' > test.txt
$ cat test.txt
é
é
$ hexdump -C test.txt
00000000  c3 a9 0a 65 cc 81 0a                              |...e...|
00000007
$ LC_ALL=C uniq -d test.txt  # No output
$ LC_ALL=en_US.UTF-8 uniq -d test.txt
é

Edit by n.m. Not all Linux systems do Unicode normalization.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • Not sure if I made a mistake but this does not seem to work here: I've normalized the data using: `cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc` and still get the same results for `sort | uniq -d` depending on the locale – kei1aeh5quahQu4U Nov 26 '13 at 20:52
  • @mt_: Hmm, then something else is going on. FWIW I'm not able to reproduce the problem with the test data you posted. No matter what `LC_COLLATE` and `LC_ALL` are, the output of `uniq -d` is always empty for me. – Adam Rosenfield Nov 26 '13 at 21:06
  • Wow. The problem was on a Ubuntu 12.04 machine. I've just downloaded the file and can reproduce the duplicates problem on a Ubuntu 13.10 machine. – kei1aeh5quahQu4U Nov 26 '13 at 21:13
  • @AdamRosenfield: `LC_ALL=en_US.UTF-8 uniq -d test.txt` outputs nothing on my machine, what gives? – n. m. could be an AI Nov 26 '13 at 21:23
  • I just tested it on a CentOS 6.4 machine and got the same result. The bytes are also different in the example file - I've updated the file it's now only a small file – kei1aeh5quahQu4U Nov 26 '13 at 21:49
  • @n.m.: Your system probably doesn't perform Unicode normalization when comparing strings in a UTF-8 locale. This would be performed inside the C standard library (e.g. glibc) inside the [`strcoll(3)`](http://linux.die.net/man/3/strcoll) function. – Adam Rosenfield Nov 26 '13 at 22:23
  • @AdamRosenfield it indeed does not, the question is why should it (where is this documented). – n. m. could be an AI Nov 27 '13 at 03:51
  • @AdamRosenfield which system are you using? Is it a Mac OS perhaps? – n. m. could be an AI Nov 27 '13 at 04:32
  • @AdamRosenfield I have edited your answer, mainly because I have accidentally downvoted it and could not revert the downvote unless an edit was done, but also to add that Unicode normalization is not universally done (I use two Linux systems, neither does it). Feel free to revert. – n. m. could be an AI Nov 27 '13 at 06:55
0

Purely conjecture at this point, since we can't see the actual data, but I would guess something like this is going on.

UTF-8 encodes code points 0-127 as their representative byte value. Values above that take two or more bytes. There is a canonical definition of which ranges of values use a certain number of bytes, and the format of those bytes. However, a code point could be encoded in a number of ways. For example - 32, the ASCII space, could be encoded as 0x20 (its canonical encoding), but, it could also be encoded as 0xc0a0. That violates a strict interpretation of the encoding, and so a well formed UTF-8 writing application would never encode it that way. However, decoders are generally written to be more forgiving, to deal with faulty encodings, and so the UTF-8 decoder in your particular situation might be seeing a sequence that isn't a strictly conforming encoded code point and interpreting it in the most reasonable way that it can, which would cause it to see certain multi-byte sequences as equivalent to others. Locale collating sequences would then also have a further effect.

In the C locale, 0x20 would certainly be sorted before 0xc0, but in UTF-8, if it grabs a following 0xa0, then that single byte would be considered equal to the two bytes, and so would sort together.

twalberg
  • 59,951
  • 11
  • 89
  • 84
  • I've added the data in the post. I've put the file through `uconv` and did a NFC normalisation but the issue still remains. I try to check for the validity of the UTF-8 next. – kei1aeh5quahQu4U Nov 26 '13 at 21:02