0

I need your help because a have the same error trying with different ways. I want to remove special characters like "áéíóúÁÉÍÓÚýÝ","àèìòùÀÈÌÒÙ""âêîôûÂÊÎÔÛ","ãõÃÕñÑ","äëïöüÄËÏÖÜÿ","çÇ" to "aeiouAEIOUXX","aeiouAEIOU","AEIOUAEIOU","AOAOXX","AEIOUAEIOUX","XX" From a data frame. Thank you!!!

First I tried doing this:

trata<-function(Campo){
  Campo<-Campo %>% chartr('ÇÆ£ØÞß&@Ð','XXXXXXXXX',.) %>%
    str_to_upper(locale = "es") %>% str_trim(side = "both") %>%
    str_replace_all("['´`^]","") %>% chartr('ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÅÃÕÑ','AEIOUAEIOUAEIOUAEIOUAAOX', .)
  return(Campo)
}


trataRS<-function(Campo){
  Campo<-Campo %>% chartr('ÇÆ£ØÞßÐ','XXXXXXXXX',.) %>%
    str_to_upper(locale = "es") %>% str_trim(side = "both") %>%
    str_replace_all("['´`^]","") %>% chartr('ÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÅÃÕ','AEIOUAEIOUAEIOUAEIOUAAO', .)
  return(Campo)
}

then I applied these functions to:

Base$paterno_originador<-trata(Base$paterno_originador)
Base$razon_originador <- trataRS(Base$razon_originador)

But I got this ERROR:

Error in chartr("ÇÆ£ØÞßÐ","XXXXXXXXX",.) : invalid input 'HÉCTOR" in 'utftowcs'

So I tried a different way that I found here from @Alexandre_Lima:

rm_accent <- function(str,pattern="all") {
  if(!is.character(str))
    str <- as.character(str)
  
  pattern <- unique(pattern)
  
  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"
  
  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )
  
  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "AEIOUAEIOU",
    tilde = "AOAOXX",
    umlaut = "AEIOUAEIOUX",
    cedil = "XX"
  )
  
  accentTypes <- c("´","`","^","~","¨","ç")
  
  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
  
  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 
  
  return(str)
}

But I got a similar ERROR:

Error in chartr(paste(symbols, collapse = ""), paste(nudeSymbols, collapse = ""),  : 
  invalid input 'RUÍZ' in 'utf8towcs'

I write this to show you the encoding. Appears UTF-8 where there is a special character in that column:

Encoding(Base$nombre_originador) [1] "unknown" "UTF-8" "unknown" "UTF-8"

1 Answers1

1

The solution to invalid input in 'utf8towcs' is set your encoding when importing the .csv file into R.

  1. When you import the file using read.csv() o read.delim(), specify encoding = "UTF-8" or encoding = "Latin-1". I hace try with "Latin-1" and it solve it.

  2. You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

Sys.getlocale() [1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"

An example

data <- read.delim("input/data/data.txt", sep=";", 
              encoding = "Latin-1", stringsAsFactors = F )

data <- read.csv("input/data/data.csv", sep=";", 
              encoding = "Latin-1", stringsAsFactors = F )

Kindest Regards