I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.
While investigating a bug I've encountered strange behavoir
$ export LC_ALL=en_US.UTF-8
$ sort part-r-00000 | uniq -d
ɥ ɨ ɞ ɧ 251
ɨ ɡ ɞ ɭ ɯ 291
ɢ ɫ ɬ ɜ 301
ɪ ɳ 475
ʈ ʂ 565
$ export LC_ALL=C
$ sort part-r-00000 | uniq -d
$ # no duplicates found
The duplicates also appear when running a custom C++ program that reads the file using
C++ seems to be unaffected at least for std::stringstream
- it fails due to duplicates when using en_US.UTF-8
locale.std::string
and input/output.
Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?
What transformations does the locale to the text that causes this behavoir?
Edit: Here is a small example
$ uniq -D duplicates.small.nfc
ɢ ɦ ɟ ɧ ɹ 224
ɬ ɨ ɜ ɪ ɟ 224
ɥ ɨ ɞ ɧ 251
ɯ ɭ ɱ ɪ 251
ɨ ɡ ɞ ɭ ɯ 291
ɬ ɨ ɢ ɦ ɟ 291
ɢ ɫ ɬ ɜ 301
ɧ ɤ ɭ ɪ 301
ɹ ɣ ɫ ɬ 301
ɪ ɳ 475
ͳ ͽ 475
ʈ ʂ 565
ˈ ϡ 565
Output of locale
when the problem appears:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=
Edit: After normalisation using:
cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc
I still get the same results
Edit: The file is valid UTF-8 according to iconv
- (from here)
$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0
Edit: Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html
It's working on FreeBSD