41

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I'm able to remove accents from a string. I'm able to remove accents from a vector of string.

iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"

But for some reason, it does not work when I apply the very same function on my term column

base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"

Does anybody know what is going on here?

hans glick
  • 2,431
  • 4
  • 25
  • 40
  • Is it because `base$terme` is a factor? Try converting to `character` first or convert the levels maybe? – NJBurgo Aug 25 '16 at 15:07
  • @NJBurgo According to the first output (assuming a typo), it’s of type `character`. – Konrad Rudolph Aug 25 '16 at 15:08
  • 2
    Careful: I get a completely different result for your vectors: ``[1] "Mill'esime" "boulang`ere"`` The `iconv` documentation specifies that `TRANSLIT` gives different results on different systems (which is of course a bit useless). – Konrad Rudolph Aug 25 '16 at 15:08
  • Try `iconv(base$terme[2],from="latin1",to="ASCII//TRANSLIT")`. If it doesn't work, please give the output of `Encoding(base$terme[2])`. – nicola Aug 25 '16 at 15:15
  • Hi Nicola, the output of Encoding(base$terme[2]) is "UTF-8" – hans glick Aug 25 '16 at 15:38

8 Answers8

48

It might be easier to use the package. This way, you don't need to check the encoding beforehand. Furthermore is consistent across operating systems and iconv is not.

library(stringi)
 
base <- data.table(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base[, terme := stri_trans_general(str = terme, 
                                   id = "Latin-ASCII")]
 
> base
           terme
1:     Millesime
2:    boulangere
3: ueaaaaceeeiii
Robert Hacken
  • 3,878
  • 1
  • 13
  • 15
Jeldrik
  • 491
  • 4
  • 5
  • 1
    This is the only method which worked for me, thanks a lot! – Janus De Bondt Mar 17 '20 at 07:30
  • This is the only method that worked for me as well. However, if someone would like to fix the following one => `ds_itens$content <- iconv(ds_itens$content, 'utf-8', 'ascii', sub = '')` would be great. This last code is almost 100% – Luis Feb 24 '22 at 15:01
36

Ok the way to solve the problem :

Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"

Thanks to @nicola

hans glick
  • 2,431
  • 4
  • 25
  • 40
12

Here is an version of Jeldrik's solution revised for DataFrames. Note the := operator is deprecated in base R.

library(stringi)

base <- data.frame(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base$terme = stri_trans_general(str = base$terme, id = "Latin-ASCII")
jf2017
  • 138
  • 1
  • 7
3

You can apply this function

    rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}
Alexandre Lima
  • 135
  • 1
  • 2
3

Three ways to remove accents - shown and compared to each other below.
The data to play with:

dtCases <- fread("https://raw.githubusercontent.com/ccodwg/Covid19Canada/master/retired_datasets/individual_level/cases_2021_1.csv", stringsAsFactors = F )
dim(dtCases) #  751526     16

Bench-marking:

> system.time(dtCases [, city0 := health_region])
   user  system elapsed 
  0.009   0.001   0.012 
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
   user  system elapsed 
  0.165   0.001   0.200 
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
   user  system elapsed 
  9.108   0.063   9.351 
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
   user  system elapsed 
   4.34    0.00    4.46 

Result:

> dtCases[city0!=city1, city0:city3] %>% unique
                           city0                         city1                         city2                         city3
                          <char>                        <char>                        <char>                        <char>
1:                      Montréal                      Montreal                      Montreal                      Montreal
2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord

Conclusion:

The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

zx8754
  • 52,746
  • 12
  • 114
  • 209
IVIM
  • 2,167
  • 1
  • 15
  • 41
2

Based on jf2017 code, here is a tidyverse solution:

library(dplyr)    
base %>%
   # latin-ASCII-free "term" column (careful: this overwrites the original "term" column!)
   mutate(term = stringi::stri_trans_general(str = term, id = "Latin-ASCII")

To apply to all columns in your data frame, use

base %>%
mutate(across(.cols = everything(),
              .fns = ~ stringi::stri_trans_general(., id = "Latin-ASCII")))

Run stringi::stri_trans_list() to see all available parameters that id can take

Oscar Montoya
  • 611
  • 4
  • 8
2

Quick, simple, easy to modify, no dependencies:

# -----------------------------------------------------------------------------
# Removes common accents from letters.
#
# @param s The string to remove diacritics from.
# -----------------------------------------------------------------------------
accentless <- function( s ) {
  chartr(
    "áéóūáéíóúÁÉÍÓÚýÝàèìòùÀÈÌÒÙâêîôûÂÊÎÔÛãõÃÕñÑäëïöüÄËÏÖÜÿçÇ",
    "aeouaeiouAEIOUyYaeiouAEIOUaeiouAEIOUaoAOnNaeiouAEIOUycC",
    s );
}
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
1

The iconv option didn't work in my computer, but I solved this issue using the replace_non_ascii function from the textclean package.