0

Similar to question How to sort words with accents?, I try to sort french words in a file on the shell, running MacOS Monterey with LANG=en_US.UTF-8, LC_ALL and LC_COLLATE not set.

$ echo $'Bénéficiaires\néboueur\nComptabilité' > sample.txt
$ LC_ALL=C sort -fd  sample.txt
Bénéficiaires
éboueur
Comptabilité

So the sort treats "é" like an empty char. Any way to fix this?

If I try sorting without LC_ALL=C, I get:

$ sort -fd  sample.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘\303BOUEUR’ and ‘COMPTABILIT\303’.
tkruse
  • 10,222
  • 7
  • 53
  • 80

1 Answers1

4

Bash has no control over how sort works internally.

You seem to be confused about your locales. LC_ALL=C specifically overrides the collation order to ignore your locale's sorting conventions.

tripleee$ LANG=fr_FR.UTF-8 sort -fd sample.txt 
bâtiment débutant
bénéficiaires
bricomarché
comptabilité
contrôle
éboueur
économie

The LANG environment variable is technically fine here, though perhaps you want to set LC_ALL=fr_FR.UTF-8 if that is your permanent locale. If you only want to affect the collation order, perhaps temporarily, that's LC_COLLATE.

The warning about invalid glyphs sounds like your text isn't actually UTF-8, so perhaps the proper fix is to change that. (Hint: iconv; but you obviously have to know or guess which encoding to translate from. Maybe see the Stack Overflow character-encoding tag info page for details.)

tripleee$ iconv -f utf-8 -t iso-8859-1 sample.txt >broken.txt
tripleee$  diff -u <(xxd sample.txt) <(xxd broken.txt) | grep 00000020
-00000020: 0a62 c3a9 6ec3 a966 6963 6961 6972 6573  .b..n..ficiaires
+00000020: 6ee9 6669 6369 6169 7265 730a 636f 6d70  n.ficiaires.comp
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks for investigating. So I am sure the file is in UTF-8. However, I found by accident that `sort -d` works fine on the same file that `sort -d` fails. Even `echo $'français\ndiplôme\néboueur' | sort -d` fails with the illgal byte sequence error, but works with sort -n. – tkruse Feb 10 '22 at 08:42
  • even though your answer is different, I'd be happy to accept it if you can include `sort -n` as a solution. I will ask a separate question about `sort -d` – tkruse Feb 10 '22 at 08:44
  • There is nothing in your question about `sort -n`; how is it even relevant here? – tripleee Feb 10 '22 at 09:05
  • I cannot repro the "illegal byte sequence" problem anyway. Could you [edit] your question to provide a [mre], probably with more information about how exactly your locale is set up? – tripleee Feb 10 '22 at 09:06
  • I hope it's now easier to reproduce, updated the descriptipn. Also I have created https://stackoverflow.com/questions/71062213, the questions seem almost duplicate now. – tkruse Feb 10 '22 at 09:35