Cluster sequences of strings in R

Question

I have to following data:

attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes 

           attributes 
1  apple-water-orange
2         apple-water
3        apple-orange
4              coffee
5    coffee-croissant
6    green-red-yellow
7      green-red-blue
8           green-red
9         black-white
10 black-white-purple

What I want is another column, that assigns a category to each row, based on observation similarity.

category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))

       attributes     category
1  apple-water-orange        1
2         apple-water        1
3        apple-orange        1
4              coffee        2
5    coffee-croissant        2
6    green-red-yellow        3
7      green-red-blue        3
8           green-red        3
9         black-white        4
10 black-white-purple        4

It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).

Does anyone have an idea how to do this task? Maybe some word-matching approaches?

It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.

Thanks in advance for any idea!

http://stackoverflow.com/questions/21511801/text-clustering-with-levenshtein-distances — LyzandeR, Jan 26 '17 at 13:13
[This post](http://stackoverflow.com/questions/8468716/clustering-strings-in-r-is-it-possible#8469845) migtht help. — Paulo MiraMor, Jan 26 '17 at 13:14
Clustering works on *distances*, regardless of data type. And you can compute distances between strings using a number of different methods. — Konrad Rudolph, Jan 26 '17 at 13:22
It appears that the first word is always the same? So perhaps `match(sub('-.*', '', attributes), unique(sub('-.*', '', attributes)))` would be enough? — Sotos, Jan 26 '17 at 13:23
@ Paulo MiraMor: The problem is, I have sequences of words - each sequence and also the included words are of differnt length. Hence using levenstein distance is challenging since its comparing the position of letters of two strings. "abcd-efgh-ijkl" and "efgh-ijkl" for example are very similar, but levenstein distance would be 0. — constiii, Jan 26 '17 at 13:29
@ Sotos: Sorry, my reproducable example is not the best. The first word is not only the same. — constiii, Jan 26 '17 at 13:30
@Konrad Rudolph Do you know a proper method by chance? Note that I dont want to cluster single words, but sequences of words (that can be in a varying order) — constiii, Jan 26 '17 at 13:32
It depends too much on how the actual data looks. (Local) edit distance may be a good choice (the counter-example you’ve shown above would have a good score in local edit distance), and the algorithm can be extended to allow inversions. If you just want to cluster based on word matches, it’s much simpler and more efficient to split your strings into words and quantify the overlap of words (`length(intersect(words_vector_1, words_vector2))`). — Konrad Rudolph, Jan 26 '17 at 13:40
@ Konrad Rudolph yeah that was what I thought of first, but how to "cluster" in dependence on words overlapping? — constiii, Jan 26 '17 at 17:49
@Anony-Mousse the question had not been answered in the other thread - since my questions consideres **sequences** of strings, where the **order** of the strings can vary! So levenstein distance doesnt work well ;) — constiii, Jan 29 '17 at 13:14

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

So I have whipped up two possibilities. Option 1: uses "one-hot-encoding" which is simple and straight forward so long as apple/apples are equally different from apple/orange, for example. I use the Jaccard index for the distance metric because it does reasonably well with overlapping sets. Option 2: Uses a local sequence alignment algorithm and should be quite robust against things like apple/apples vs. apple/orange, it will also have more tuning parameters which could take time to optimize for your problem.

library(reshape2)
library(proxy)

attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", 
                "coffee-croissant", "green-red-yellow", "green-red-blue", 
                "green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")

df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""), 
                 word=unlist(attributesList))

df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1] 
df.wide <- as.matrix(df.wide[, -1])

df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))

res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res

You'll see you can control the granularity of the categorization by adjusting where you cut the dendrogram.

Here is a method using the "Smith-Waterman" alignment (local) alignment

Biostrings is part of the Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree to set your categories but you can also tune the scoring function to suit your needs.

library(Biostrings)
strList <- lapply(attributes, BString)

swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
  pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")@score
}), nrow = 10)

heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
          labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))

Cluster sequences of strings in R

1 Answers1

Here is a method using the "Smith-Waterman" alignment (local) alignment