10

Short version. I load() data in a package. Previously, a test in a package passed, now it fails because the output of sort changed. Here is a minimal reproducible example - for details see below:

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"        
# NEW 4.0.0 [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital" 
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"     

# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"  

# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  

The only related reported change I found is

CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.

But I am even not sure, if this could explain what I see. As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something? (A change to a sort.int with method radix would do the job, but still: Why did it change? Is that really better?

I just realized, that (thanks to Roland) sort calls in my case sort.int:

function (x, decreasing = FALSE, na.last = NA, ...) 
{
  if (is.object(x)) 
    x[order(x, na.last = na.last, decreasing = decreasing)]
  else sort.int(x, na.last = na.last, decreasing = decreasing, 
    ...)
}

From ?sort.int:

The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)

And according to the docs, sort.int did not change from 4.0.0 to 4.0.2.

From ?data.table::setorder

data.table always reorders in "C-locale". As a consequence, the ordering may be different to that obtained by base::order. In English locales, for example, sorting is case-sensitive in C-locale. Thus, sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but c("a", "B", "c") in base::order. Note this makes no difference in most cases of data; both return identical results on ids where only upper-case or lower-case letters are present ("AB123" < "AC234" is true in both), or on country names and other proper nouns which are consistently capitalized. For example, neither "America" < "Brazil" nor "america" < "brazil" are affected since the first letter is consistently capitalized.

Using C-locale makes the behaviour of sorting in data.table more consistent across sessions and locales. The behaviour of base::order depends on assumptions about the locale of the R session. In English locales, "america" < "BRAZIL" is true by default but false if you either type Sys.setlocale(locale="C") or the R session has been started in a C locale for you – which can happen on servers/services since the locale comes from the environment the R session was started in. By contrast, "america" < "BRAZIL" is always FALSE in data.table regardless of the way your R session was started.

(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)


Details

R.version # old              _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.2                         
year           2018                        
month          12                          
day            20                          
svn rev        75870                       
language       R                           
version.string R version 3.5.2 (2018-12-20)
nickname       Eggshell Igloo 

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"   

# =======
R.version # new after upgrade
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          0.0                         
year           2020                        
month          04                          
day            24                          
svn rev        78286                       
language       R                           
version.string R version 4.0.0 (2020-04-24)
nickname       Arbor Day

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"   

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
#[1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  

# ==== Test with new 4.0.2
R.version
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          0.2                         
year           2020                        
month          06                          
day            22                          
svn rev        78730                       
language       R                           
version.string R version 4.0.2 (2020-06-22)
nickname       Taking Off Again 

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital" 
Christoph
  • 6,841
  • 4
  • 37
  • 89
  • Perhaps `sort.int(y, method="radix")`? Documentation says `"radix"` is stable (not tested). – jay.sf Aug 06 '20 at 07:16
  • 1
    @jay.sf method "shell" also does the job - see my edit. But still: what happend? Is that on purpose? Could that change again in the future? Our analyses need to be stable over time. Thanks for your help! – Christoph Aug 06 '20 at 07:29
  • Can you try with R 4.0.1 or R 4.0.2? There was this NEWS item: 'In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <- I(letters), was accidentally using method = "radix". ' With `German_German` locale I get the same sorting as you report for R 3.5.2 with your example. I have many old versions on my system but unfortunately I skipped installing R 4.0.0 and can't test with it. – Roland Aug 06 '20 at 07:44
  • "C-locale" means, among other things case-sensitive. See `sort(c("A", "a"))` and `sort(c("A", "a"), method = "radix")`. – Hugh Aug 06 '20 at 08:37
  • 1
    @jay.sf See my edit. 4.0.2 changed it back... – Christoph Aug 06 '20 at 10:22
  • @Christoph Be careful that you don't conflate two different changes. The unintended changes in behavour of `sort.int` are probably unrelated to the changed behavior of `load`. – Roland Aug 06 '20 at 10:24

1 Answers1

2

In summary, it was a bug which has been removed in R version 4.0.1. As @Roland figured out.
From CRAN:

In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <-I(letters), was accidentally usingmethod = "radix". Consequently, e.g., merge(<data.frame>) was much slower than previously; reported in PR#17794.

Christoph
  • 6,841
  • 4
  • 37
  • 89