8

I am wanting to use MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX functions within 'R' so I can categorize and summarize like values to minimize data cleansing operations prior to analysis.

I am fully aware that each algorithm has its own strengths and weakness and would highly prefer not to use SoundEx but it still might work if I cannot find alternatives; as like mentioned in this post Harper would match with any of a list of unrelated names under SoundEx but should not in Metaphone for better result matching.

Though I am not sure which would serve my purposes best while still preserving some flexibility so that is the reason I want to take a stab with several of them as well as before looking at the values generate a table like the following.

enter image description here

Table Source Link

Surnames are not the subject of my initial analysis but think it is a good example as I want to effectively consider all like 'sounding' words treated as the same value is really what I am trying to do with a simply call something as values are evaluated.

Some things I have already looked at:

  • I know that a C package could be written and called with RCpp, and there are even C solutions for SoundEx on SE, but I have not written an R package before and looking to avoid re-inventing the wheel if there is a simpler way to do it directly in R or a package exists that has the function available?
  • I am aware that the RecordLinkage and now stringdist package have a SoundEx function, but not any form of a MetaPhone function.

So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?

The additional caveat is I am still consider my self pretty new to R as I am not a daily user of it.

Community
  • 1
  • 1
CRSouser
  • 658
  • 9
  • 25

4 Answers4

10

The algorithm is pretty straightforward but I, too, could not find an existing R package. If you really need to do this work in R, one short-term option is to install the python module metaphone (pip install metaphone) then use the rPython bridge to use it in R:

library(rPython)

python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""

It's not the most elegant solution, but it gets you metaphone operations in R.

The Apache Commons has a codec library that also implements the metaphone algorithms:

library(rJava)

.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"

You can make the above .jcall an R function and use it like any other R function:

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

sapply(c("abridgement", "stupendous"), metaphone)

## abridgement  stupendous 
##      "ABRJ"      "STPN"

The java interface may be more compatible across platforms, too.

Here's a more complete view of using the java interface:

library(rJava)

.jinit()

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

double_metaphone <- function(x) {
  .jcall(dmp,"S","doubleMetaphone", x)  
}

words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')

data.frame(metaphone=sapply(words, metaphone),
           double=sapply(words, double_metaphone))

##           metaphone double
## Catherine      K0RN   K0RN
## Katherine      K0RN   K0RN
## Katarina       KTRN   KTRN
## Johnathan      JN0N   JN0N
## Jonathan       JN0N   JN0N
## John             JN     JN
## Teresa          TRS    TRS
## Theresa         0RS    0RS
## Smith           SM0    SM0
## Smyth           SM0    SM0
## Jessica         JSK    JSK
## Joshua           JX     JX
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
10

There is now an implementation of Double Metaphone in R in the package PGRdup.

install.packages(PGRdup)
library(PGRdup)
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')
DoubleMetaphone(words)

$primary
 [1] "K0RN" "K0RN" "KTRN" "JN0N" "JN0N" "JN"   "TRS"  "0RS"  "SM0"  "SM0"  "JSK"  "JX"  

$alternate
 [1] "KTRN" "KTRN" "KTRN" "ANTN" "ANTN" "AN"   "TRS"  "TRS"  "XMT"  "XMT"  "ASK"  "AX"  
Crops
  • 5,024
  • 5
  • 38
  • 65
3

I have been working on a package for this, called phonics, for a few months. I have implemented several of the common and less common, including Caverphone, Caverphone2, Metaphone, and soundex. A few others are implemented. I still have a handful more I plan to implement before calling it 1.0, but I just submitted a release of the package to CRAN.

James Howard
  • 1,258
  • 2
  • 11
  • 26
  • James please also include double metaphone with maxcodelength as argument, in your awesome library – AnilGoyal Aug 18 '23 at 04:58
  • 1
    @AnilGoyal, You know, there was a reason I didn't include double metaphone originally and frankly, I don't remember what that was! I will look into it. – James Howard Aug 25 '23 at 00:02
  • Yes, please look into it. Actually, I do not know C++/C that's why I am unable to implement DoubleMetaphone algorithm in R. Nevertheless, your library is awesome and will become full-fledged if you can implement this algo in it. @James Howard – AnilGoyal Aug 25 '23 at 02:01
0

Here is a caverphone interpretation, while it captures the cascading rules approach, keep in mind that caverphone was always intended as an example of customising to a region accent context (though people do use it in a general purpose kind of way as it makes a different set of tradeoffs to most of the others based on their own regions), so I would suggest a) getting the unique characters in your data source to make sure you are processing them all, b) consider changing the final length limitation in relation to the names you are working with, and c) have a think about modelling the regional accent mix- this was for modelling the various accent groups in New Zealand in the late 1800s/ early 1900s and the way they might mis-transcribe what each other was saying.

caverphonise <- function(x) {
# Convert to lowercase
x <- tolower(x)

# Remove anything not A-Z
x <- gsub("[^a-z]", "", x)

# If the name starts with
## cough make it cou2f
x <- gsub("^cough", "cou2f", x)
## rough make it rou2f
x <- gsub("^rough", "rou2f", x)
## tough make it tou2f
x <- gsub("^tough", "tou2f", x)
## enough make it enou2f
x <- gsub("^enough", "enou2f", x)
## gn make it 2n
x <- gsub("^gn", "2n", x)

# If the name ends with
## mb make it m2
x <- gsub("mb$", "m2", x)

# Replace
## cq with 2q
x <- gsub("cq", "2q", x)
## ci with si
x <- gsub("ci", "si", x)
## ce with se
x <- gsub("ce", "se", x)
## cy with sy
x <- gsub("cy", "sy", x)
## tch with 2ch
x <- gsub("tch", "2ch", x)
## c with k
x <- gsub("c", "k", x)
## q with k
x <- gsub("q", "k", x)
## x with k
x <- gsub("x", "k", x)
## v with f
x <- gsub("v", "f", x)
## dg with 2g
x <- gsub("dg", "2g", x)
## tio with sio
x <- gsub("tio", "sio", x)
## tia with sia
x <- gsub("tia", "sia", x)
## d with t
x <- gsub("d", "t", x)
## ph with fh
x <- gsub("ph", "fh", x)
## b with p
x <- gsub("b", "p", x)
## sh with s2
x <- gsub("sh", "s2", x)
## z with s
x <- gsub("z", "s", x)
## any initial vowel with an A
x <- gsub("^[aeiou]", "A", x)
## all other vowels with a 3
x <- gsub("[aeiou]", "3", x)
## 3gh3 with 3kh3
x <- gsub("3gh3", "3kh3", x)
## gh with 22
x <- gsub("gh", "22", x)
## g with k
x <- gsub("g", "k", x)
## groups of the letter s with a S
x <- gsub("s+", "S", x)
## groups of the letter t with a T
x <- gsub("t+", "T", x)
## groups of the letter p with a P
x <- gsub("p+", "P", x)
## groups of the letter k with a K
x <- gsub("k+", "K", x)
## groups of the letter f with a F
x <- gsub("f+", "F", x)
## groups of the letter m with a M
x <- gsub("m+", "M", x)
## groups of the letter n with a N
x <- gsub("n+", "N", x)
## w3 with W3
x <- gsub("w3", "W3", x)
## wy with Wy
x <- gsub("wy", "Wy", x)
## wh3 with Wh3
x <- gsub("wh3", "Wh3", x)
## why with Why
x <- gsub("why", "Why", x)
## w with 2
x <- gsub("w", "2", x)
## any initial h with an A
x <- gsub("^h", "A", x)
## all other occurrences of h with a 2
x <- gsub("h", "2", x)
## r3 with R3
x <- gsub("r3", "R3", x)
## ry with Ry
x <- gsub("ry", "Ry", x)
## r with 2
x <- gsub("r", "2", x)
## l3 with L3
x <- gsub("l3", "L3", x)
## ly with Ly
x <- gsub("ly", "Ly", x)
## l with 2
x <- gsub("l", "2", x)
## j with y
x <- gsub("j", "y", x)
## y3 with Y3
x <- gsub("y3", "Y3", x)
## y with 2
x <- gsub("y", "2", x)

# remove all
## 2s
x <- gsub("2", "", x)
## 3s
x <- gsub("3", "", x)
# put six 1s on the end
x <- paste(x,"111111", sep="")
# take the first six characters as the code
unlist(lapply(x, FUN= function(x){paste((strsplit(x, "")[[1]])[1:6], collapse="")}))
}
David Hood
  • 11
  • 1