tolower throws error on line even though it's read as UTF-8

Question

When turning some lines into lower case, R throws an error that I didn't expect.

Error in tolower(readLines(x, encoding = "UTF-8")) : 
  invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Fijne dag en take care ðŸ€</sentence>' in 'utf8towcs'

ðŸ€ is the culprit. However, why does this happen? I figured this was an encoding problem, but my readLines function clearly states that the encoding has to be UTF-8. What's going on?

Example data for x:

/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Take care !</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Take care meisje X</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Hele fijne dag en take care ☀⛄</sentence>

I am aware of solutions (I found this one works best) but I want to know why the encoding doesn't work correctly. What goes wrong?

No explanation, however, it's a common issue that many approach by incorporating a `trycatch` (you'll find plenty examples when using your favorite web search engine). You could also use stringi's tolower pendant: `x <- "\ud83d\udc4d\ud83d\ude09"; stringi::stri_trans_tolower(x); tolower(x)`. — lukeA, Aug 24 '15 at 22:57
Also have no explanation. When I used `tolower` on the string printed for the error message it gave `ðÿ€`. Not sure what was intended with the example since it did not match either the error message or the text of your question. I got `⛄` turned into `\u26c4`. Using Mac OSX 10.7.5 with R 3.2.1 with a US locale. You have not specified your locale. — IRTFM, Aug 24 '15 at 23:00
@lukeA I suppose stringi handles `x` as UTF-8 by default? Additionally: would `stri_trans_tolower` be faster than the built-on `tolower` function? — Bram Vanroy, Aug 24 '15 at 23:01
@BramVanroy I guess stringi assumes `stri_enc_mark(x)` by default. stringi functions are said to be fast. You can benchmark using `microbenchmark::microbenchmark()`. — lukeA, Aug 25 '15 at 07:47

tolower throws error on line even though it's read as UTF-8

0 Answers0