1

When turning some lines into lower case, R throws an error that I didn't expect.

Error in tolower(readLines(x, encoding = "UTF-8")) : 
  invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Fijne dag en take care ðŸ€</sentence>' in 'utf8towcs'

🀠is the culprit. However, why does this happen? I figured this was an encoding problem, but my readLines function clearly states that the encoding has to be UTF-8. What's going on?

Example data for x:

/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Take care !</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Take care meisje X</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:  <sentence>Hele fijne dag en take care ☀⛄</sentence>

I am aware of solutions (I found this one works best) but I want to know why the encoding doesn't work correctly. What goes wrong?

Community
  • 1
  • 1
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • No explanation, however, it's a common issue that many approach by incorporating a `trycatch` (you'll find plenty examples when using your favorite web search engine). You could also use stringi's tolower pendant: `x <- "\ud83d\udc4d\ud83d\ude09"; stringi::stri_trans_tolower(x); tolower(x)`. – lukeA Aug 24 '15 at 22:57
  • Also have no explanation. When I used `tolower` on the string printed for the error message it gave `ðÿ€`. Not sure what was intended with the example since it did not match either the error message or the text of your question. I got `⛄` turned into `\u26c4`. Using Mac OSX 10.7.5 with R 3.2.1 with a US locale. You have not specified your locale. – IRTFM Aug 24 '15 at 23:00
  • @lukeA I suppose stringi handles `x` as UTF-8 by default? Additionally: would `stri_trans_tolower` be faster than the built-on `tolower` function? – Bram Vanroy Aug 24 '15 at 23:01
  • @BramVanroy I guess stringi assumes `stri_enc_mark(x)` by default. stringi functions are said to be fast. You can benchmark using `microbenchmark::microbenchmark()`. – lukeA Aug 25 '15 at 07:47

0 Answers0