13

I'm dealing with a large amount of data, mostly names with non-English characters. My goal is to match these names against some information on them collected in the USA.

ie, I might want to match the name 'Sølvsten' (from some list of names) to 'Soelvsten' (the name as stored in some American database). Here is a function I wrote to do this. It's clearly clunky and somewhat arbitrary, but I wonder if there is a simple R function that translates these foreign characters to their nearest English neighbours. I understand that there might not be any standard way to do this conversion, but I'm just curious if there is and if that conversion can be done through an R function.

# a function to replace foreign characters
replaceforeignchars <- function(x)
{
    require(gsubfn);
    x <- gsub("š","s",x)
    x <- gsub("œ","oe",x)
    x <- gsub("ž","z",x)
    x <- gsub("ß","ss",x)
    x <- gsub("þ","y",x)
    x <- gsub("à","a",x)
    x <- gsub("á","a",x)
    x <- gsub("â","a",x)
    x <- gsub("ã","a",x)
    x <- gsub("ä","a",x)
    x <- gsub("å","a",x)
    x <- gsub("æ","ae",x)
    x <- gsub("ç","c",x)
    x <- gsub("è","e",x)
    x <- gsub("é","e",x)
    x <- gsub("ê","e",x)
    x <- gsub("ë","e",x)
    x <- gsub("ì","i",x)
    x <- gsub("í","i",x)
    x <- gsub("î","i",x)
    x <- gsub("ï","i",x)
    x <- gsub("ð","d",x)
    x <- gsub("ñ","n",x)
    x <- gsub("ò","o",x)
    x <- gsub("ó","o",x)
    x <- gsub("ô","o",x)
    x <- gsub("õ","o",x)
    x <- gsub("ö","o",x)
    x <- gsub("ø","oe",x)
    x <- gsub("ù","u",x)
    x <- gsub("ú","u",x)
    x <- gsub("û","u",x)
    x <- gsub("ü","u",x)
    x <- gsub("ý","y",x)
    x <- gsub("ÿ","y",x)
    x <- gsub("ğ","g",x)

    return(x)
}

Note: I know there exist name matching algorithms such as Jaro Winkler Distance Matching, but I'd rather do exact matches.

krishnan
  • 671
  • 1
  • 10
  • 21

6 Answers6

21

Try using the chartr R function for the one character substitutions (which should be quite fast) and then clean it up with a series of gsub calls for each of the one-to-two character substitutions (which presumably will be slower but there are not many of them).

to.plain <- function(s) {

   # 1 character substitutions
   old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
   new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
   s1 <- chartr(old1, new1, s)

   # 2 character substitutions
   old2 <- c("œ", "ß", "æ", "ø")
   new2 <- c("oe", "ss", "ae", "oe")
   s2 <- s1
   for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)

   s2
}

Add to old1, new1, old2 and new2 as needed.

Here is a test:

> s <- "æxš"
> to.plain(s)
[1] "aexs"

UPDATE: corrected variable names in chartr.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks, Gabor (I am assuming you are the same as http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=39147). I tested all three solutions posted so far and this looks like the quickest (albeit, I just observed execution time and didn't actually **time**, _and_ it was on a laptop that isn't plugged in so who knows what's driving efficiency :-)) – krishnan Jul 08 '13 at 00:48
  • shouldn't it be `s1 <- chatr(old1,new1,s)`? – Lucarno Nov 29 '13 at 19:47
  • Mon œil, I run into an encoding issue here. On windows using chartr::base for single characters works, but looping for multi-character replacement of ligatures does not work on vectors containing the UTF-8 content "œ". (The rest of the ligatures work fine.) My workaround (*cough* iconv(s, "UTF-8", "latin1") *cough*) produces an artifact: "œ" is converted to "o" (by iconv::base), instead of "oe" by the loop. Guess this is caused by its omission from ISO-8859-1, but I cannot find a solution. Any ideas? – aae May 10 '16 at 09:17
  • Solved by using stringi::stri_trans_general("œ", "Latin-ASCII"), which did what iconv() and gsub() couldn't. – aae May 10 '16 at 19:56
  • This is great but I tried it on a Kaggle corpus of scientific papers and it seems to break on whatever is being interpreted as ",â€" just FYI `Error in chartr(old1, new1, s) : invalid input 'Using this “cancer pathway approach,†TSGs regulating cell signaling, ` – Hack-R Jul 15 '17 at 15:24
12

Edit for a potentially better result...

This might not work for all cases, but iconv might be worth investigating. From ?iconv:

Description:

 This uses system facilities to convert a character vector between
 encodings: the ‘i’ stands for ‘internationalization’.

Example:

test <- c("Sølvsten", "Günther")
iconv(test, "latin1", "ASCII//TRANSLIT")
#[1] "Solvsten" "Gunther" 

This isn't hugely simplified, but I think there is something to be said for separating the data from the code. This then is very similar to this question:

R: replace characters using gsub, how to create a function?

Define the from and to:

fromto <- read.table(text="
from to
š s
œ oe
ž z
ß ss
þ y
à a
á a
â a
ã a
ä a
å a
æ ae
ç c
è e
é e
ê e
ë e
ì i
í i
î i
ï i
ð d
ñ n
ò o
ó o
ô o
õ o
ö o
ø oe
ù u
ú u
û u
ü u
ý y
ÿ y
ğ g",header=TRUE)

Then the function:

replaceforeignchars <- function(dat,fromto) {
  for(i in 1:nrow(fromto) ) {
    dat <- gsub(fromto$from[i],fromto$to[i],dat)
  }
  dat
}

test <- c("Sølvsten", "Günther")
replaceforeignchars(test,fromto)
#[1] "Soelvsten" "Gunther"
Community
  • 1
  • 1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
8

You can install the uni2ascii C program and call it from R.

uni2ascii <- function(string) {
    cmd <- sprintf("echo %s | uni2ascii -B", string)
    system(cmd, intern = TRUE, ignore.stderr = TRUE)
}

uni2ascii <- Vectorize(uni2ascii, USE.NAMES = FALSE)

uni2ascii(c("Sølvsten", "ğ", "œ"))
## [1] "Solvsten" "g"        "oe"
dickoa
  • 18,217
  • 3
  • 36
  • 50
4

In the meantime, you can also use stri_trans_general() from the stringi package.

library(stringi)

x <- c("š", "ž", "ğ", "ß", "þ", "à", "á", "â", "ã", "ä", "å", "æ", 
       "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", 
       "ó", "ô", "õ", "ö", "ø", "œ", "ù", "ú", "û", "ü", "ý", "ÿ")
y <- stri_trans_general(x, "Latin-ASCII")

data.frame(x, y, stringsAsFactors = FALSE)
#>    x  y
#> 1  š  s
#> 2  ž  z
#> 3  ğ  g
#> 4  ß ss
#> 5  þ th
#> 6  à  a
#> 7  á  a
#> 8  â  a
#> 9  ã  a
#> 10 ä  a
#> 11 å  a
#> 12 æ ae
#> 13 ç  c
#> 14 è  e
#> 15 é  e
#> 16 ê  e
#> 17 ë  e
#> 18 ì  i
#> 19 í  i
#> 20 î  i
#> 21 ï  i
#> 22 ð  d
#> 23 ñ  n
#> 24 ò  o
#> 25 ó  o
#> 26 ô  o
#> 27 õ  o
#> 28 ö  o
#> 29 ø  o
#> 30 œ oe
#> 31 ù  u
#> 32 ú  u
#> 33 û  u
#> 34 ü  u
#> 35 ý  y
#> 36 ÿ  y

Note that this converts “ø” to “o”, however.

stri_trans_general("Sølvsten", "Latin-ASCII")
#> [1] "Solvsten"
dpprdan
  • 1,727
  • 11
  • 24
1

Extending the answer of thelatemail: The original replaceforeignchars function contains a loop, which can consume resources for large texts. Here' an apply function which does exactly the same without explicit loop. As it stands, it works for a single string (e.g. not string vectors).

replaceforeignchars <- function(dat,fromto) {
   paste0(apply(matrix(unlist(strsplit(dat,""))),1,FUN=function(x) {ifelse(x %in% fromto$from, as.character( fromto[fromto$from==x, 'to']),  x)}), collapse="") 
} 
test <- c("Sølvsten")
replaceforeignchars(test,fromto)
[1] "Solvsten"
Thanos
  • 27
  • 4
1

Extending a little bit on the answer from dpprdan and also using stringi::stri_trans_general, you are able to define custom rules/deviation in the transliteration. In my experience using "Latin-ASCII" within stri_trans_general gives me the expected transliterations 9 out of 10 times.

In my case, I want the letter ø to be transliterated to oe and the letter å to be aa. The normal behaviour of "Latin-ASCII" would return o and a respectively.

## Define custom rules for å and ø, otherwise transliterate according to Latin-ASCII
custom_rules <- "å > aa;
                 ø > oe;
                 ::Latin-ASCII;"

stringi::stri_trans_general(c("Tårnby", "Søborg"), id = custom_rules, rules = TRUE)
[1] "Taarnby" "Soeborg"

It is case sensitive, so you need to define the upper case letters as well if they appear in your data. Personally, I have just converted all my text to lower. I have made a list from the transliterated data with the letters before and after transliteration to keep track of any unexpected behaviour.

So far, I have had 33 "odd" letters transliterated in my data and ø and å were the only two letters, where I wanted a deviating rule.

More information about custom rules can be found here. If you want the transliteration to ignore a letter in the process, I have an example of that here.

Thranholm
  • 21
  • 2