Concatenate gsub

Question

I'm currently running the following code to clean my data from accent characters:

df <- gsub('Á|Ã', 'A', df)
df <- gsub('É|Ê', 'E', df)
df <- gsub('Í',   'I', df)
df <- gsub('Ó|Õ', 'O', df)
df <- gsub('Ú',   'U', df)
df <- gsub('Ç',   'C', df)

However, I would like to do it in just one line (using another function for it would be ok). How can I do this?

The “real” way to do this doesn’t involve regular expressions but rather Unicode normalisation. However, I’m not sure how well supported Unicode library bindings (such as ICU) are in R, and thus how practically feasible a proper solution would be. — Konrad Rudolph, Dec 04 '13 at 19:42

Usobi · Accepted Answer · 2013-12-04T19:51:19.147

5

Try something like this

iconv(c('Á'), "utf8", "ASCII//TRANSLIT")

You can just add more elements to the c().

EDIT: it is machine dependent, check help(iconv)

Here is the R solution

mychar <- c('ÁÃÉÊÍÓÕÚÇ')
iconv(mychar, "latin1", "ASCII//TRANSLIT") # one line, as requested
[1] "AAEEIOOUC"

edited Dec 04 '13 at 19:51

answered Dec 04 '13 at 19:43

Usobi

1,816
4
18
25

1

This returns `NA` for me. – Matthew Plourde Dec 04 '13 at 19:46
3

This should do it `mychar <- c('Á', 'Ã', 'É', 'Ê', 'Í', 'Ó', 'Õ', 'Ú','Ç')` `iconv(mychar, "latin1", "ASCII//TRANSLIT")` – Ben Dec 04 '13 at 19:47
Yeah, on Windows you should use something like "ISO-8859-1" instead of "utf8". – Waldir Leoncio Dec 04 '13 at 19:51
1

This is an interesting solution, but it only seems to work with single vectors. I actually have to read a names vector with 900 entries, and I can't just shove the whole vector into `iconv`. – Waldir Leoncio Dec 04 '13 at 19:55
2

@WaldirLeoncio sure you can: `Vectorize(iconv)(rep(mychar, 2), "latin1", "ASCII//TRANSLIT")` – Matthew Plourde Dec 04 '13 at 19:59
@WaldirLeoncio Update your question with a more accurate example of your real use-case. `sapply` might be part of the solution. – Ben Dec 04 '13 at 19:59
or even `paste(mychar,collapse=`) if you know a good separator. – Usobi Dec 04 '13 at 20:01
Nevermind, halfway through creating a loop here I found out I had typed the wrong vector name. `iconv` works fine with vectors, real sorry. :$ – Waldir Leoncio Dec 04 '13 at 20:06

agstudy · Answer 2 · 2013-12-04T19:56:50.033

It an encoding problem, Normally you resolve it by indicating the right encoding. If you still want to use regular expression to do it , you can use gsubfn to write one liner solution:

library(gsubfn)
ll <- list('Á'='A', 'Ã'='A', 'É'='E',
           'Ê'='E', 'Í'='I', 'Ó'='O',
           'Õ'='O', 'Ú'='U', 'Ç'='C')
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,'ÁÃÉÊÍÓÕÚÇ')
[1] "AAEEIOOUC"
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,c('ÁÃÉÊÍÓÕÚÇ','ÍÓÕÚÇ'))
[1] "AAEEIOOUC" "IOOUC"

score 1 · Answer 3 · answered Dec 04 '13 at 19:39

One option could be chartr

> toreplace <- LETTERS
> replacewith <- letters
> (somestring <- paste(sample(LETTERS,10),collapse=""))
[1] "MUXJVYNZQH"
> 
> chartr(
+   old=paste(toreplace,collapse=""),
+   new=paste(replacewith,collapse=""),
+   x=somestring
+   )
[1] "muxjvynzqh"

score 0 · Answer 4 · answered Dec 04 '13 at 19:44

0

df = as.data.frame(apply(df,2,function(x) gsub('Á|Ã', 'A', df)))

2 indicates columns and 1 indicates rows

answered Dec 04 '13 at 19:44

BigDataScientist

1,045
5
17
37

This works for one "A" substitutions, but what about the rest? My point is: how can I do it without running the same command six times? – Waldir Leoncio Dec 04 '13 at 19:48
Please see agstudy's answer above, that works perfectly – BigDataScientist Dec 05 '13 at 15:25

Concatenate gsub

4 Answers4

Linked