5

I usually use the arrange() function from dplyr to sort datasets, but it behaved in a way that I couldn't understand. Took me a little while to get to the bottom of this. I've fixed my code and used order() to do the same thing, but now I'm curious. I have used arrange() without thinking twice for ages, and I wonder why this seems to be the default behavior. It looks like it fails to sort alphabetically when capital letters are involved--as in, it believes capital letters should come prior to lowercase letters, even if the latter precede them in the alphabet. Am I missing something?

This is not always a problem, but it did become one for me when I used tapply() immediately after arranging via arrange(), assuming that the data would be sorted in the same way that tapply() sorts when running. Here's an example of arrange() putting "USSR" before "Uganda" and the "Ukraine", whereas order() (correctly, I think!) puts it last.

library(dplyr)
countries<-c("USSR","Uganda","Ukraine")
tmp<-data.frame(countries,stringsAsFactors=F)
tmp %>% arrange(countries) #orders it one way
tmp[order(tmp$countries),] #orders it another way
sort(tmp$countries) #sort agrees with order

I looked around to see whether others had encountered this same problem, and couldn't see anything. Forgive me if this has been discussed previously.

daanoo
  • 771
  • 5
  • 18
  • 2
    I wonder if this bit from ?arrange is relevant: "Note that for local data frames, the ordering is done in C++ code which does not have access to the local specific ordering usually done in R. This means that strings are ordered as if in the C locale." – atiretoo Sep 17 '15 at 21:57
  • 3
    See also the discussion on the `?Comparison` help page, specifically the paragraph describing essentially how `order()` works that begins "Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use." – MrFlick Sep 17 '15 at 22:01
  • Thanks. That is annoying! But revealing.. – daanoo Sep 17 '15 at 22:04
  • Possible duplicate of [Why does R 3.6.0 return FALSE when evaluating the expression ("Dogs" < "cats")?](https://stackoverflow.com/questions/56485774/why-does-r-3-6-0-return-false-when-evaluating-the-expression-dogs-cats) – divibisan Jun 07 '19 at 15:13
  • This question is newer, but has a much more detailed answer – divibisan Jun 07 '19 at 15:14

1 Answers1

3

Yes, the comment from @MrFlick is correct. If I do

Sys.setlocale("LC_COLLATE","C")

then

tmp[order(tmp$countries),]

matches the result from arrange()

atiretoo
  • 1,812
  • 19
  • 33
  • But! There doesn't seem to be a way to easily change the locale for dplyr, at least on a windows box. – atiretoo Sep 17 '15 at 22:23