12

I have some complicated code, but instead of showing you that, I am going to extract the essence of the problem.

Evaluate: "dogs" < "cats" … This should evaluate to FALSE and it does in R 3.6.

Evaluate: "Dogs" < "cats" … This should evaluate to TRUE because the ASCII code for "D" is 68 and the ASCII code for "c" is 99. Since 68 < 99, "Dogs" < "cats" should evaluate to TRUE, but it does not in R 3.6.0. However, when I tried using the Console window on the https://datacamp.com website, the expression "Dogs" < "cats" returned TRUE and the expression "dogs" < "Cats" returned FALSE - as expected.

Hence, my question is, why does R 3.6.0 return FALSE for ("Dogs" < "cats") ?

Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • 2
    Possible duplicate of [Why does "one" < 2 equal FALSE in R?](https://stackoverflow.com/questions/27005295/why-does-one-2-equal-false-in-r) – Shree Jun 06 '19 at 22:25
  • 2
    From help file on comparison: "Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising." – qwr Jun 06 '19 at 22:29
  • Possible duplicate of [arrange() putting capital letters first](https://stackoverflow.com/questions/32640551/arrange-putting-capital-letters-first) – qwr Jun 06 '19 at 22:30
  • 1
    So I would guess that your locale differs to that used by the Datacamp console. See what `Sys.getlocale()` returns in each case. – neilfws Jun 06 '19 at 22:32

1 Answers1

15

The interpreter at DataCamp shows:

> Sys.getlocale()
[1] "C"

whereas mine and maybe yours is:

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

With the "C" locale, characters are compared by their ascii values, whereas for en_US.UTF-8, they go aAbBcC and so on.

As mentioned in the comments, this is explained further in the documentation for relational operators:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.

C. Braun
  • 5,061
  • 19
  • 47