4

I have to following data:

attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes 

           attributes 
1  apple-water-orange
2         apple-water
3        apple-orange
4              coffee
5    coffee-croissant
6    green-red-yellow
7      green-red-blue
8           green-red
9         black-white
10 black-white-purple

What I want is another column, that assigns a category to each row, based on observation similarity.

category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))

       attributes     category
1  apple-water-orange        1
2         apple-water        1
3        apple-orange        1
4              coffee        2
5    coffee-croissant        2
6    green-red-yellow        3
7      green-red-blue        3
8           green-red        3
9         black-white        4
10 black-white-purple        4

It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).

Does anyone have an idea how to do this task? Maybe some word-matching approaches?

It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.

Thanks in advance for any idea!

constiii
  • 638
  • 3
  • 19
  • http://stackoverflow.com/questions/21511801/text-clustering-with-levenshtein-distances – LyzandeR Jan 26 '17 at 13:13
  • [This post](http://stackoverflow.com/questions/8468716/clustering-strings-in-r-is-it-possible#8469845) migtht help. – Paulo MiraMor Jan 26 '17 at 13:14
  • 1
    Clustering works on *distances*, regardless of data type. And you can compute distances between strings using a number of different methods. – Konrad Rudolph Jan 26 '17 at 13:22
  • It appears that the first word is always the same? So perhaps `match(sub('-.*', '', attributes), unique(sub('-.*', '', attributes)))` would be enough? – Sotos Jan 26 '17 at 13:23
  • @ Paulo MiraMor: The problem is, I have sequences of words - each sequence and also the included words are of differnt length. Hence using levenstein distance is challenging since its comparing the position of letters of two strings. "abcd-efgh-ijkl" and "efgh-ijkl" for example are very similar, but levenstein distance would be 0. – constiii Jan 26 '17 at 13:29
  • @ Sotos: Sorry, my reproducable example is not the best. The first word is not only the same. – constiii Jan 26 '17 at 13:30
  • @Konrad Rudolph Do you know a proper method by chance? Note that I dont want to cluster single words, but sequences of words (that can be in a varying order) – constiii Jan 26 '17 at 13:32
  • 1
    It depends too much on how the actual data looks. (Local) edit distance may be a good choice (the counter-example you’ve shown above would have a good score in local edit distance), and the algorithm can be extended to allow inversions. If you just want to cluster based on word matches, it’s much simpler and more efficient to split your strings into words and quantify the overlap of words (`length(intersect(words_vector_1, words_vector2))`). – Konrad Rudolph Jan 26 '17 at 13:40
  • @ Konrad Rudolph yeah that was what I thought of first, but how to "cluster" in dependence on words overlapping? – constiii Jan 26 '17 at 17:49
  • @Anony-Mousse the question had not been answered in the other thread - since my questions consideres **sequences** of strings, where the **order** of the strings can vary! So levenstein distance doesnt work well ;) – constiii Jan 29 '17 at 13:14

1 Answers1

2

So I have whipped up two possibilities. Option 1: uses "one-hot-encoding" which is simple and straight forward so long as apple/apples are equally different from apple/orange, for example. I use the Jaccard index for the distance metric because it does reasonably well with overlapping sets. Option 2: Uses a local sequence alignment algorithm and should be quite robust against things like apple/apples vs. apple/orange, it will also have more tuning parameters which could take time to optimize for your problem.

library(reshape2)
library(proxy)

attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", 
                "coffee-croissant", "green-red-yellow", "green-red-blue", 
                "green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")

df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""), 
                 word=unlist(attributesList))

df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1] 
df.wide <- as.matrix(df.wide[, -1])

df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))

res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res

You'll see you can control the granularity of the categorization by adjusting where you cut the dendrogram.

enter image description here

enter image description here

Here is a method using the "Smith-Waterman" alignment (local) alignment

Biostrings is part of the Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree to set your categories but you can also tune the scoring function to suit your needs.

library(Biostrings)
strList <- lapply(attributes, BString)

swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
  pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")@score
}), nrow = 10)

heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
          labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))

enter image description here

Community
  • 1
  • 1
emilliman5
  • 5,816
  • 3
  • 27
  • 37