R- delete accents in string

Question

I have a library with html files and in files_dep I have the list of them. I need to convert the text stored in them to a table, but the issue is that they have accents and ñ. I wrote this to read it and works ok.

for (i in files_dep) {
  text<-readLines(i,encoding="UTF-8")
  aa<-paste(text, collapse=' ')
  if (grepl(empieza,aa) & grepl(termina,aa)) {
    nota=gsub(paste0("(^.*", empieza, ")(.*?)(", termina, ".*)$"), "\\2", aa)
    #nota<-iconv(nota,to="ASCII//TRANSLIT")
    df<-rbind(df, data.frame(fileName=i, nota=nota)) }}

I can read things like:

Este sábado enfrentarán a un equipo.

So I only need to delete the accents. I tried uncommenting the

nota <- iconv(nota,to="ASCII//TRANSLIT")

but I get:

 Este sA!bado se enfrentarA!n a un equipo.

So, I don't know what the problem is.

Also, I need to delete accents and all special characters. Thanks

Edition:

I took the last data stored in nota at the end of the loop. THis is what I see:

nota
[1] "                         <p>La inclusión del seleccionado argentino en el viejo Tres Naciones significó, hace tres años, la confirmación de que el nivel del rugby argentino estaba a la altura de los grandes equipos del planeta, aunque se preveía que esa transición entre ser un equipo <em>del montón</em>&nbsp;a formar parte de la<em> elite </em>no iba a ser sencilla<em>. </em>Hoy, luego de dos años de competencia en el Rugby Championship, Los Pumas están cada vez más cerca de dar el batacazo y conseguir su primer triunfo en la historia del torneo.</p><p>

If I do:

iconv(nota,to="ASCII//TRANSLIT")

I get:

iconv(nota,to="ASCII//TRANSLIT")
[1] "                         <p>La inclusiA3n del seleccionado argentino en el viejo Tres Naciones significA3, hace tres aA?os, la confirmaciA3n de que el nivel del rugby argentino estaba a la altura de los grandes equipos del planeta, aunque se preveA-a que esa transiciA3n entre ser un equipo <em>del montA3n</em>&nbsp;a formar parte de la<em> elite </em>no iba a ser sencilla<em>. </em>Hoy, luego de dos aA?os de competencia en el Rugby Championship, Los Pumas estA!n cada vez mA!s cerca de dar el batacazo y conseguir su primer triunfo en la historia del torneo.

What OS and R version are you using? When i run `nota<-"Este sábado enfrentarán a un equipo."; iconv(nota, to="ASCII//TRANSLIT")`, I get `"Este sabado enfrentaran a un equipo."` running R 3.1.1 on Windows. — MrFlick, Oct 15 '14 at 22:54
@MrFlick - it probably has to do with locale too. The above code works the same for me, but I'm in an "English_United States" locale as per `Sys.getlocale()` — thelatemail, Oct 15 '14 at 22:59
@thelatemail I get > Sys.getlocale() [1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252" — GabyLP, Oct 15 '14 at 23:02
@thelatemail In the case of `iconv`, it should only be affected by `Encoding(nota)`, but you are right in that the default locale may affect the encoding, but if you're using `readLines()` with `encoding="UTF-8"` that should keep everything as UTF-8. — MrFlick, Oct 15 '14 at 23:05
@MrFlick, if I do that I also get the right result, the rpoblem is inside the loop. Dn't know why. > iconv("este sábado" ,to="ASCII//TRANSLIT") [1] "este sabado" — GabyLP, Oct 15 '14 at 23:10
@GabyP Are you sure the encoding of the file is UTF-8? When you print out the string, is that what it looks like in R or some other editor? Maybe your file is really "latin1" encoding. Try `readLines(i,encoding="latin1")` Otherwise, please try to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). If my example works, then what's different about your data? — MrFlick, Oct 15 '14 at 23:14
ok, doesn't matter I solved it with chatr: for (r in 1:nrow(df)) {df[r,3]<-chartr("áéíóú", "aeiou",df[r,2])} — GabyLP, Oct 15 '14 at 23:59

score 32 · Accepted Answer · edited Aug 31 '16 at 14:23

32

When I faced a similar problem, I used the function stri_trans_general from the stringi package. For example you can try: stri_trans_general(nota,"Latin-ASCII")

edited Aug 31 '16 at 14:23

Daniel Falbel

1,721
1
21
41

answered Feb 28 '16 at 15:30

José

921
14
21

score 1 · Answer 2 · answered Feb 18 '19 at 17:20

I use this function

 rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}

R- delete accents in string

2 Answers2