4

I'm just having a pain with R (on OS X).

I have a set of german named files. And have the strange behavior that I do this example (the first 'Käse' was inputted from keyboard - the second copied from ls.files() output):

names <- c('Käse', 'Käse')
grepl('Käse', names)

# [1] TRUE FALSE

After a lot of brain bashing I noticed in the console that the Umlauts were displayed slightly different.

Finally I found that:

iconv(names,'latin1','ascii','bytes')

# [1] "K<c3><a4>se"  "Ka<cc><88>se"

Which was especially surprising, as the letter ä is part of the ASCII characters with code 132.

I also notice that when I input (input from keyboard)

system('touch käse2')

it is automatically converted to the second encoding.

So my question is - how can I configure R that the umlauts I type in regular expressions will match those that are used in file names?

The output of Sys.getlocale:

> Sys.getlocale()
[1] "de_AT.UTF-8/de_AT.UTF-8/de_AT.UTF-8/C/de_AT.UTF-8/de_AT.UTF-8"

Update

The behavior that bothers me the most is following:

filename <- 'Käse.Rdata'
save(file=filename)
list.files(pattern=filename)
# character(0)

so the filename is not equal to the string that was used to create it.

Hmm - this seems Mac specific - on my windows machine it works as expected.

bdecaf
  • 4,652
  • 23
  • 44

1 Answers1

4

"K<c3><a4>se" encodes the "ä" as unicode character U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS).

"Ka<cc><88>se" encodes the "ä" as unicode characters U+0061 (LATIN SMALL LETTER A) and U+0308 (COMBINING DIAERESIS).

Both are technically correct, but distinct. To compare them, you will need to normalize the strings. You could use the package stringi:

stri_trans_nfc("Ka\u0308se") -> "K\u00E4se"

More information:

Community
  • 1
  • 1
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
  • 1
    Arguably, perhaps R should be doing this behind the scenes, at least for some string operations. – tripleee Nov 10 '15 at 07:59
  • I suppose I will use this. But one of the annoyances is that when a file is created the encoding somehow is converted. – bdecaf Nov 10 '15 at 08:40