1

I was wondering how to sort alphabetically a list of Spanish words [with accents].

Excerpt from the word list:

Chocó
Cundinamarca
Córdoba
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Another.Chemist
  • 2,386
  • 3
  • 29
  • 43
  • 1
    What happens if you just use `sort`? What is your OS? – n. m. could be an AI Jul 28 '14 at 02:57
  • It prints like is shown in the post, cygwin – Another.Chemist Jul 28 '14 at 02:59
  • Make sure you've installed (and selected) a Spanish locale – rici Jul 28 '14 at 03:00
  • Good point, how can I do it? – Another.Chemist Jul 28 '14 at 03:04
  • 4
    try `LANG=es_ES.utf8 sort your-file.txt` – n. m. could be an AI Jul 28 '14 at 03:04
  • Thanks @n.m., it works!!! Next time, how can I use the other options of bash shell command lines like _LANG=es_ES.utf8_? – Another.Chemist Jul 28 '14 at 03:06
  • 1
    The notation @n.m. suggested temporarily exports the environment variable LANG with the `es_ES.utf8` setting. I'm not sure quite what you're after, but setting that in the environment permanently (rather than just for the one `sort` command) will mean all tools (that pay attention to locale) will use the value automatically. – Jonathan Leffler Jul 28 '14 at 03:30
  • Thanks so much @JonathanLeffler, last question: How can I know which command lines pay attention to locale? (i.e. I only know that awk is quiet nutty with accents) – Another.Chemist Jul 28 '14 at 03:36
  • 1
    I'm not sure whether there's a good way other than the infamous 'trial and error' mechanism. You can look at the manual pages, but they may not tell you anything useful. You might get hold of the source to look at it. Basically, it depends on whether the command does a `setlocale("");` at startup; by default, the system does the equivalent of `setlocale("C");` at startup (usually not explicitly), which would be analogous to using `LANG=en_US.utf8`. – Jonathan Leffler Jul 28 '14 at 03:39
  • What `echo $LANG` shows? If you don't have LANG set, your bash is not reading /etc/profile. You are probably not running it as `bash -l`. If this is the case, and you want it this way, you may want to selectively enable some stuff from /etc/profile.d, e.g. lang.sh, but I'm drifting off topic here. In short, make sure LANG=....utf8 is set. If you want e.g. English UI but Spanish sort rules, set `LANG=en_US.utf8` and `LC_COLLATE=es_ES.utf8`. – n. m. could be an AI Jul 28 '14 at 04:03
  • @JonathanLeffler: Good advice in general, but the "C" locale is a single-byte encoding that recognizes letters only in the 7-bit ASCII range - unlike the (multi-byte-on-demand) `en_US.UTF-8` locale, which, as the name suggests, is Unicode-aware. In fact, the sample input sorts correctly with locale `en_US.UTF-8`. – mklement0 Jul 28 '14 at 04:17
  • @mklement0: some code tries, and some code doesn't try, to understand locales. For example, `more` on Mac OS X has difficulties with UTF-8. For example, from a change log entry: `2013-09-08: Fix compilation warnings under stringent compilation options, reported by yvind .` where the `` is how `more` chooses to represent `Ø` (the gentleman comes from Norway; his name is Øyvind, and Ø is U+00D8 in Unicode). This despite having LANG=en_US.UTF-8 in the environment. One of the reasons for saying "trial and error" is precisely that some programs do not handle locales well. – Jonathan Leffler Jul 28 '14 at 04:25
  • @JonathanLeffler: Understood, and agreed (even more disconcertingly, `sort` on OSX is not locale-aware). What I was responding to is your statement `the system does the equivalent of setlocale("C"); at startup (usually not explicitly), which would be analogous to using LANG=en_US.utf8`, which I interpreted to mean that you're claiming that locale "C" and "en_US.UTF-8" are equivalent, which is _not_ true. Perhaps I misunderstood your intent. – mklement0 Jul 28 '14 at 04:29
  • @JonathanLeffler That particular changelog entry is probably because he submitted his name in Latin-1 or a similar encoding, which is incorrect UTF-8. Somebody who is unfamiliar with the issues would perhaps just copy/paste what's on the screen. – tripleee Jul 28 '14 at 04:50
  • 1
    @mklement0: Oh, I see what you mean. Hmmm…I think I misspoke; thank you for calling me on it. It's a tricky area, at best. The misspoken part is 'analogous to using `LANG=en_US.utf8`'; I believe the rest of what I said is valid. That phrase should perhaps be 'analogous to using a US English locale and an appropriate character set' (leaving it to be determined what is an appropriate character set). – Jonathan Leffler Jul 28 '14 at 04:54
  • @tripleee: ah, I see; `vim` is misleading me because it is showing ø in the file, even though as you suspected, there is no leading 0xC3 byte that would be needed for the character to be in UTF-8 (U+00D8 is represented by 0xC3 0xD8 in UTF-8, as you know but this is for the benefit of those who come later, it anyone does). But when I type the keyboard sequence to enter ø in UTF-8, `vim` also displays that as ø (and saves it as ISO 8859-1 rather than UTF-8)! Oh well, it is what happens when you don't check the data before posting. – Jonathan Leffler Jul 28 '14 at 05:02

1 Answers1

1

Cygwin uses GNU utilities, which are usually well-behaved when it comes to locales - a notable and regrettable exception is awk (gawk)ref.

The following is based on Cygwin 1.7.31-3, current as of this writing.

  • Cygwin by default uses the locale implied by the current Windows user's UI language, combined with UTF-8 character encoding.
    • Note that it's NOT based on the setting for date/time/number/currency formats, and changing that makes no difference. The limitation of basing the locale on the UI language is that it invariably uses that language's "home" region; e.g., if your UI language is Spanish, Cygwin will invariably use en_ES, i.e., Spain's locale. The only way to change that is to explicitly override the default - see below.
  • You can override this in a variety of ways, preferably by defining a persistent Windows environment variable named LANG (see below; for an overview of all methods, see https://superuser.com/a/271423/139307)

To see what locale is in effect in Cygwin, run locale and inspect the value of the LANG variable.

If that doesn't show es_*.utf8 (where * represents your region in the Spanish-speaking world, e.g., CO for Colombia, ES for Spain, ...), set the locale as follows:

  • In Windows, open the Start menu and search for 'environment', then select Edit environment variables for your account, which opens the Environment Variables dialog.
  • Edit or create a variable named LANG with the desired locale, e.g., es_CO.utf8 -- UTF-8 character encoding is usually the best choice.

Any Cygwin bash shell you open from the on should reflect the new locale - verify by running locale and ensuring that the LC_* values match the LANG value and that no warnings are reported.

At that point, the following:

sort <<<$'Chocó\nCundinamarca\nCórdoba'

should produce (i.e., ó will sort directly after o, as desired):

Chocó
Córdoba
Cundinamarca

Note: locale en_US.utf8 would produce the same output - apparently, it generically sorts accented characters directly after their base characters - which may or may not be what a specific non-US locale actually does.

Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775