0

I have a loooong list of countries which I need to classify them in groups according to their original name. Actually there are lots of misspelling but there are many which are written in other languages. E.g.

THAILAND TUNESIE TUNIS TUNISIE TURCQUIE TURKIJE TURQUIE Tailand italie italien italy

How can I pool them in groups easily? as classifying them by hand is the biggest pain ever. I have thought on some way of reading strings or characters, but I haven't figured out an efficient way to do so. I can work with R and C/C++.

I'd really appreciate some help!

Thank you very much for your help!!

adrian1121
  • 904
  • 2
  • 9
  • 21

1 Answers1

0

Here's one approach:

x <- trimws(readLines(n=11))
THAILAND 
TUNESIE 
TUNIS 
TUNISIE 
TURCQUIE 
TURKIJE 
TURQUIE 
Tailand 
italie 
italien 
italy
m <- adist(x, x, ignore.case = T); colnames(m) <- x; rownames(m) <- x
hc <- hclust(as.dist(m), method="average")
plot(hc); rect.hclust(hc, h=3.8)
split(x, cutree(hc, h=3.8))
# $`1`
# [1] "THAILAND" "Tailand" 
# 
# $`2`
# [1] "TUNESIE" "TUNIS"   "TUNISIE"
# 
# $`3`
# [1] "TURCQUIE" "TURKIJE"  "TURQUIE" 
# 
# $`4`
# [1] "italie"  "italien" "italy" 

Here is another one.

Community
  • 1
  • 1
lukeA
  • 53,097
  • 5
  • 97
  • 100