Trouble with casefold() due to Non-English letters

Question

All I want to do is change the address column in df to upper case

df$address <- casefold(df$address, upper = TRUE)

but I keep getting the following error - probably because of the 'I' with an accent

Error in toupper(x) : 
  invalid input 'POLÍGONO INDUSTRIAL OLASO' in 'utf8towcs'

I know this observation is already upper case, but not all of them are. I don't want to just substitute all of these instances for their English counterpart, mainly because an Eszett (ß) shows up later and I don't know what that would be replaced with.

Possible duplicate of [R tm package invalid input in 'utf8towcs'](https://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs) — CT Hall, Oct 11 '18 at 21:14

CT Hall · Accepted Answer · 2018-10-11T21:18:44.077

Casefold works as expected with the i accent on my account.

> casefold('POLÍGONO INDUSTRIAL OLASO')
[1] "polígono industrial olaso"
> casefold('POLÍGONO INDUSTRIAL OLASO', upper = TRUE)
[1] "POLÍGONO INDUSTRIAL OLASO"

For eszett it leaves as is.

> casefold('daß')
[1] "daß"
> casefold('daß', upper = T)
[1] "DAß"

You may want to check out the package stringr which will translate eszett to SS.

> library(stringr)
> str_to_lower('daß')
[1] "daß"
> str_to_upper('daß')
[1] "DASS"

But it doesn't work the other way around.

> str_to_lower('DASS')
[1] "dass"

Trouble with casefold() due to Non-English letters

1 Answers1