9

I'm trying to get the day of the week, and have it work consistently in any locale. In locales with Latin alphabets, everything is fine.

Sys.getlocale()
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
weekdays(Sys.Date())
## [1] "Tuesday"

I have two related problems with other locales.

If I set

Sys.setlocale("LC_ALL", "Arabic_Qatar")
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=Arabic_Qatar.1256;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"

then I sometimes (correctly) get

weekdays(Sys.Date())
## [1] "الثلاثاء

and sometimes get

weekdays(Sys.Date())
## [1] "ÇáËáÇËÇÁ"

depending upon my setup. The problem is, I can't figure out what is causing the difference.

I thought it might be something to do with getOption("encoding"), but I've tried explicitly setting options(encoding = "native.enc") and options(encoding = "UTF-8") and it makes no difference.

I've tried several recent versions of R, and the problem is consistent across all of them.

At the moment, the string displays correctly in R GUI, but incorrectly when I use an IDE (Architect and RStudio tested).

What should I set to ensure that weekdays always displays correctly?

It may be helpful to know that weekdays(Sys.Date()) is equivalent to format(as.POSIXlt(Sys.Date()), "%A"), which calls an internal format.POSIXlt method.

Secondly, it seems overkill to change all of the locale. I thought I should just be able to set the time options. However, if I set individual components of the locale, weekdays returns a string of question marks.

for(category in c("LC_TIME", "LC_CTYPE", "LC_COLLATE", "LC_MONETARY"))
{
  Sys.setlocale(category, "Arabic_Qatar")
  print(Sys.getlocale())
  print(weekdays(Sys.Date()))
}
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=Arabic_Qatar.1256;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"

What parts of the locale affect how the weekdays are printed?


Update: The problem seems to be Windows-related. When I run the code on a Linux box with locale "ar_QA.UTF8", the weekdays are correctly displayed.


Further update: As agstudy mentioned in his answer, setting locales under Windows is odd, since you can't just use ISO codes like "en-GB". For Windows 7/Vista/Server 2003/XP you can set a locale using setlocale language strings or National Language Support values. For Qatari Arabic, there is no setlocale language string, so we must use an NLS value. We have several choices:

Sys.setlocale("LC_TIME", "ARQ")    # the language abbreviation name
Sys.setlocale("LC_TIME", "Arabic_Qatar") # corresponding to the language/country pair "Arabic (Qatar)"
Sys.setlocale("LC_TIME", "Arabic_Qatar.1256") # explicitly including the ANSI codepage
Sys.setlocale("LC_TIME", "Arabic") # would sometimes be a possibility too, but it defaults to Saudi Arabic

So the problem isn't that R cannot support Arabic locales under Windows (though I'm not entirely convinced of the robustness of Sys.setlocale).


Desperate last ditch attempt: Trying to magically fix things by using Windows Management Instrumentation Command to change the OS locale doesn't work, since R doesn't appear to recognise the changes.

system("wmic os set locale=MS_4001") 
## Updating property(s) of '\\PC402729\ROOT\CIMV2:Win32_OperatingSystem=@'
## Property(s) update successful.
system("wmic os get locale") # same as before
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360

2 Answers2

5

The system of naming locales is OS-specific. I recommend you to read the locales from R Installation and Administration manual for a complete explanation.

under windows :

The list of supported language is listed MSDN Language Strings. And surprisingly there is not Arabic language there. The "Language string" column contains the legal input for setting locale in R and even in the list contry /regions strings there no country spoken arabic there.

Of course you can change your locale global settings( panel setting --> region --> ..) but this will change it globally and it is not sure to get the right output without encoding problem.

under linux(ubuntu in my case):

Arabic is generally not supported by default, but is easy to set it using locale.

 locale -a                     ## to list all already supported language
 sudo locale-gen ar_QA.UTF-8   ## install it in case does not exist

under RStudio now :

 Sys.setlocale('LC_TIME','ar_QA.UTF-8')
[1] "ar_QA.UTF-8"

> format(Sys.Date(),'%A')
[1] "الثلاثاء

Note also that under R console the printing is not as pretty as in R studio because it is written from left to right not from right to left.

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks for the links. I've read a bit and tested a bit, and Arabic is (at least in theory) supported, so I think the problem must lie elsewhere. Of course, the problem is that I'm not sure where that else is. – Richie Cotton Oct 28 '14 at 11:57
5

The RStudio/Architect problem

This can be solved, slightly messily, by explicitly changing the encoding of the weekdays string to UTF-8.

current_codepage <- as.character(l10n_info()$codepage)
iconv(weekdays(Sys.Date()), from = current_codepage, to = "utf8")

Note that codepages only exist on Windows; l10n_info()$codepage is NULL on Linux.

The LC_TIME problem

It turns out that under Windows you have to set both the LC_CTYPE and LC_TIME locale categories, and you have to set LC_CTYPE before LC_TIME, or it won't work.


In the end, we need different implementations for different OSes.

Windows version:

get_today_windows <- function(locale = NULL)
{
  if(!is.null(locale))
  {
    lc_ctype <- Sys.getlocale("LC_CTYPE")
    lc_time <- Sys.getlocale("LC_TIME")
    on.exit(Sys.setlocale("LC_CTYPE", lc_ctype))
    on.exit(Sys.setlocale("LC_TIME", lc_time), add = TRUE)
    Sys.setlocale("LC_CTYPE", locale)
    Sys.setlocale("LC_TIME", locale)
  }
  today <- weekdays(Sys.Date())
  current_codepage <- as.character(l10n_info()$codepage)
  iconv(today, from = current_codepage, to = "utf8")
}
get_today_windows() 
## [1] "Tuesday"
get_today_windows("French_France")
## [1] "mardi"
get_today_windows("Arabic_Qatar")
## [1] "الثلاثاء"
get_today_windows("Serbian (Cyrillic)") 
## [1] "уторак"
get_today_windows("Chinese (Traditional)_Taiwan") 
## [1] "星期二"

Linux version:

get_today_linux <- function(locale = NULL)
{
  if(!is.null(locale))
  {
    lc_time <- Sys.getlocale("LC_TIME")
    on.exit(Sys.setlocale("LC_TIME", lc_time), add = TRUE)
    Sys.setlocale("LC_TIME", locale)
  }
  weekdays(Sys.Date())
}
get_today_linux() 
## [1] "Tuesday"
get_today_linux("fr_FR.utf8")
## [1] "mardi"
get_today_linux("ar_QA.utf8")
## [1] "الثلاثاء"
get_today_linux("sr_RS.utf8") 
## [1] "уторак"
get_today_linux("zh_TW.utf8") 
## [1] "週二"

Enforcing the .utf8 encoding in the locale seems important get_today_linux("zh_TW") doesn't display properly.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Sharp-eyed readers may notice the difference in the Chinese day names. Is one of these wrong? Or are they just different scripts? – Richie Cotton Oct 28 '14 at 13:09