Remove accents from a dataframe column in R

Question

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I'm able to remove accents from a string. I'm able to remove accents from a vector of string.

iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"

But for some reason, it does not work when I apply the very same function on my term column

base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"

Does anybody know what is going on here?

Is it because `base$terme` is a factor? Try converting to `character` first or convert the levels maybe? — NJBurgo, Aug 25 '16 at 15:07
@NJBurgo According to the first output (assuming a typo), it’s of type `character`. — Konrad Rudolph, Aug 25 '16 at 15:08
Careful: I get a completely different result for your vectors: ``[1] "Mill'esime" "boulang`ere"`` The `iconv` documentation specifies that `TRANSLIT` gives different results on different systems (which is of course a bit useless). — Konrad Rudolph, Aug 25 '16 at 15:08
Try `iconv(base$terme[2],from="latin1",to="ASCII//TRANSLIT")`. If it doesn't work, please give the output of `Encoding(base$terme[2])`. — nicola, Aug 25 '16 at 15:15

score 48 · Answer 1 · edited Nov 21 '22 at 16:00

48

It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and iconv is not.

library(stringi)
 
base <- data.table(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base[, terme := stri_trans_general(str = terme, 
                                   id = "Latin-ASCII")]
 
> base
           terme
1:     Millesime
2:    boulangere
3: ueaaaaceeeiii

edited Nov 21 '22 at 16:00

Robert Hacken

3,878
1
13
15

answered Jun 14 '19 at 09:20

Jeldrik

491
4
5

1

This is the only method which worked for me, thanks a lot! – Janus De Bondt Mar 17 '20 at 07:30
This is the only method that worked for me as well. However, if someone would like to fix the following one => `ds_itens$content <- iconv(ds_itens$content, 'utf-8', 'ascii', sub = '')` would be great. This last code is almost 100% – Luis Feb 24 '22 at 15:01

score 36 · Answer 2 · answered Aug 26 '16 at 07:57

36

Ok the way to solve the problem :

Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"

Thanks to @nicola

answered Aug 26 '16 at 07:57

hans glick

2,431
4
25
40

score 12 · Answer 3 · answered Aug 13 '21 at 22:30

Here is an version of Jeldrik's solution revised for DataFrames. Note the := operator is deprecated in base R.

library(stringi)

base <- data.frame(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base$terme = stri_trans_general(str = base$terme, id = "Latin-ASCII")

score 3 · Answer 4 · answered Feb 18 '19 at 16:18

You can apply this function

    rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}

Unlike `iconv` this isn't vectorized and doesn't cover all possibilities. — Leonardo Siqueira, Apr 10 '19 at 19:29

score 3 · Answer 5 · edited Sep 02 '21 at 09:38

Three ways to remove accents - shown and compared to each other below.
The data to play with:

dtCases <- fread("https://raw.githubusercontent.com/ccodwg/Covid19Canada/master/retired_datasets/individual_level/cases_2021_1.csv", stringsAsFactors = F )
dim(dtCases) #  751526     16

Bench-marking:

> system.time(dtCases [, city0 := health_region])
   user  system elapsed 
  0.009   0.001   0.012 
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
   user  system elapsed 
  0.165   0.001   0.200 
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
   user  system elapsed 
  9.108   0.063   9.351 
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
   user  system elapsed 
   4.34    0.00    4.46

Result:

> dtCases[city0!=city1, city0:city3] %>% unique
                           city0                         city1                         city2                         city3
                          <char>                        <char>                        <char>                        <char>
1:                      Montréal                      Montreal                      Montreal                      Montreal
2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord

Conclusion:

The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

Oscar Montoya · Answer 6 · 2022-10-12T19:18:09.877

Based on jf2017 code, here is a tidyverse solution:

library(dplyr)    
base %>%
   # latin-ASCII-free "term" column (careful: this overwrites the original "term" column!)
   mutate(term = stringi::stri_trans_general(str = term, id = "Latin-ASCII")

To apply to all columns in your data frame, use

base %>%
mutate(across(.cols = everything(),
              .fns = ~ stringi::stri_trans_general(., id = "Latin-ASCII")))

Run stringi::stri_trans_list() to see all available parameters that id can take

score 2 · Answer 7 · answered Mar 25 '23 at 19:00

Quick, simple, easy to modify, no dependencies:

# -----------------------------------------------------------------------------
# Removes common accents from letters.
#
# @param s The string to remove diacritics from.
# -----------------------------------------------------------------------------
accentless <- function( s ) {
  chartr(
    "áéóūáéíóúÁÉÍÓÚýÝàèìòùÀÈÌÒÙâêîôûÂÊÎÔÛãõÃÕñÑäëïöüÄËÏÖÜÿçÇ",
    "aeouaeiouAEIOUyYaeiouAEIOUaeiouAEIOUaoAOnNaeiouAEIOUycC",
    s );
}

score 1 · Answer 8 · answered Apr 11 '23 at 21:42

1

The iconv option didn't work in my computer, but I solved this issue using the replace_non_ascii function from the textclean package.

answered Apr 11 '23 at 21:42

andrés castro araújo

36
2

Remove accents from a dataframe column in R

8 Answers8

Linked