When turning some lines into lower case, R throws an error that I didn't expect.
Error in tolower(readLines(x, encoding = "UTF-8")) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: <sentence>Fijne dag en take care ðŸ€</sentence>' in 'utf8towcs'
ðŸ€
is the culprit. However, why does this happen? I figured this was an encoding problem, but my readLines function clearly states that the encoding has to be UTF-8. What's going on?
Example data for x
:
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: <sentence>Take care !</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: <sentence>Take care meisje X</sentence>
/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: <sentence>Hele fijne dag en take care ☀⛄</sentence>
I am aware of solutions (I found this one works best) but I want to know why the encoding doesn't work correctly. What goes wrong?