3

I'm currently running the following code to clean my data from accent characters:

df <- gsub('Á|Ã', 'A', df)
df <- gsub('É|Ê', 'E', df)
df <- gsub('Í',   'I', df)
df <- gsub('Ó|Õ', 'O', df)
df <- gsub('Ú',   'U', df)
df <- gsub('Ç',   'C', df)

However, I would like to do it in just one line (using another function for it would be ok). How can I do this?

Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107
  • 1
    The “real” way to do this doesn’t involve regular expressions but rather Unicode normalisation. However, I’m not sure how well supported Unicode library bindings (such as ICU) are in R, and thus how practically feasible a proper solution would be. – Konrad Rudolph Dec 04 '13 at 19:42

4 Answers4

5

Try something like this

iconv(c('Á'), "utf8", "ASCII//TRANSLIT")

You can just add more elements to the c().

EDIT: it is machine dependent, check help(iconv)

Here is the R solution

mychar <- c('ÁÃÉÊÍÓÕÚÇ')
iconv(mychar, "latin1", "ASCII//TRANSLIT") # one line, as requested
[1] "AAEEIOOUC"
Usobi
  • 1,816
  • 4
  • 18
  • 25
3

It an encoding problem, Normally you resolve it by indicating the right encoding. If you still want to use regular expression to do it , you can use gsubfn to write one liner solution:

library(gsubfn)
ll <- list('Á'='A', 'Ã'='A', 'É'='E',
           'Ê'='E', 'Í'='I', 'Ó'='O',
           'Õ'='O', 'Ú'='U', 'Ç'='C')
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,'ÁÃÉÊÍÓÕÚÇ')
[1] "AAEEIOOUC"
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,c('ÁÃÉÊÍÓÕÚÇ','ÍÓÕÚÇ'))
[1] "AAEEIOOUC" "IOOUC"   
agstudy
  • 119,832
  • 17
  • 199
  • 261
1

One option could be chartr

> toreplace <- LETTERS
> replacewith <- letters
> (somestring <- paste(sample(LETTERS,10),collapse=""))
[1] "MUXJVYNZQH"
> 
> chartr(
+   old=paste(toreplace,collapse=""),
+   new=paste(replacewith,collapse=""),
+   x=somestring
+   )
[1] "muxjvynzqh"
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54
0
df = as.data.frame(apply(df,2,function(x) gsub('Á|Ã', 'A', df)))

2 indicates columns and 1 indicates rows

BigDataScientist
  • 1,045
  • 5
  • 17
  • 37