resolving OS vs list.files() handling of accented characters

Question

I'm correcting some homework from students who have accented characters in their names (not my home locale), and foolishly decided to honor the actual spelling of their names in creating files with my comments (Firstname.Lastname). In general, I created the file names (in the console or within Emacs by using the Compose Key (e.g. compose-'-a to generate á). This has given rise to the following mismatch between the OS and R's list.files():

system("touch testá")      ## create file with accented character in name
list.files(pattern="test") ## it's there ...
## [1] "testá"

But when I try to match the whole word in the pattern argument ...

list.files(pattern="testá")
## character(0)

This is on Xubuntu 16.04, but it's a virtual machine so the underlying file system is HFS. My normal locale is

[1] "LC_CTYPE=en_CA.UTF8;LC_NUMERIC=C;LC_TIME=en_CA.UTF8;LC_COLLATE=en_CA.UTF8;LC_MONETARY=en_CA.UTF8;LC_MESSAGES=en_CA.UTF8;LC_PAPER=en_CA.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF8;LC_IDENTIFICATION=C"

but switching it via Sys.setlocale("LC_ALL","pl_PL.UTF8") (which apparently succeeds) doesn't help.

What's really weird (to me) is that doing the same exercise with "testł" does work ...

As suggested in comments, I explored a bit more with charToRaw. There is in fact a difference between the string representation in R and the name as stored on disk:

charToRaw("testá")
## [1] 74 65 73 74 c3 a1
charToRaw(list.files(pattern="test"))
## [1] 74 65 73 74 61 cc 81

I have tried this on Ubuntu 14.04 and I can match the whole word. Are the byte sequences of the strings the same in your example? Mine are: `> charToRaw("touch testá") [1] 74 6f 75 63 68 20 74 65 73 74 c3 a1 > charToRaw("testá") [1] 74 65 73 74 c3 a1`. `Encoding(...)` shows `UTF-8` for both strings — R Yoda, May 06 '18 at 21:57

IRTFM · Answer 1 · 2018-05-06T22:40:40.700

I'm on a Mac and getting the same as you. Tried give a pattern of "test\u00e1" which is what my as.hexmode(utf8ToInt("á")) said was the ASCII value:

Ended up suggesting brute force for the problem at hand.

> file.rename("testá", "testXXX")
[1] TRUE
> list.files(pattern="testXXX")
[1] "testXXX"

As did R Yoda I first looked at charToRaw and received an incorrect translation, and I get this:

> "\u00e1"
[1] "á"
> "test\uc3a1"
[1] "test쎡"

score 1 · Accepted Answer · answered May 07 '18 at 02:11

Thanks to clues from @42- and @RYoda: since my underlying file system is HFS+, I was able to find this blog post on "HFS+ and utf8 accented characters", which led me to this SO question & answer on Unicode normalization, which leads to the solution

list.files(pattern=stringi::stri_trans_nfd("testá"))

where ?stri_trans_nfd lets us know that "nfd" stands for

• NFD (Canonical Decomposition),

resolving OS vs list.files() handling of accented characters

2 Answers2