0

This code:

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("world hello", "hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

distance_function <- function(string_1, string_2) {
    stringdist(string_1, string_2, method = "qgram")
}

combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))

distance_matrix <- matrix(0, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))

for (i in 1:ncol(combinations)) {

    distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])

    distance_matrix[combinations[1, i], combinations[2, i]] <- distance
    distance_matrix[combinations[2, i], combinations[1, i]] <- distance
}

dendo <- hclust(dist(1 - distance_matrix), method = "ward.D2")

grp <- cutree(dendo, k = 3)

grp[dendo$order]

Results in:

hello world hello world             hello vorld           hello world 1             hello world             world hello             hello world 
                      3                       2                       1                       1                       1                       1

How can I pivot this to transform it into a dataframe like this, ordered by 'similarity':

hello world hello world 3
hello vorld 2
hello world 1 1
hello world 1
world hello 1
hello world 1

Btw, why does:

class(grp[dendo$order])

result in:

[1] "integer"

Surely, it is not an integer?

cs0815
  • 16,751
  • 45
  • 136
  • 299
  • It is just a named integer vector. Nothing special there really. You can do `stack(as.data.frame(as.list(grp[dendo$order])))` as one option. – MrFlick May 03 '19 at 16:45
  • If you want a tibble (which is basically a data.frame) you can just do `tibble::enframe(grp[dendo$order])`. – MrFlick May 03 '19 at 16:48
  • Thanks. I am sorry but IMHO this is not a duplicate. The original characters are also transferred into factors, which introduces 'dot notation' using your suggested approach, which is wrong (e.g. hello world -> hello.world). I simply want the original stings as one column and the cluster membership as another column. – cs0815 May 04 '19 at 08:25

0 Answers0