I have a dataframe my_df with 10,000 different sequences with different lengths (between 13to18) they comprised from different numbers (0-3)
example of my data (60 lines) :
library(stringdist)
library(igraph)
library(reshape2)
structure(list(alfa_ch = c("2000000232003211","2000000331021", "20000003310320011", "20000003323331021",
"20000003331001","20000003331001", "20000003332021", "200000100331021",
"20000013011001","20000013301021", "2000001333331011", "20000023231031",
"200000233302001","20000023331011", "20000023331012", "20000023332021",
"200000233331021","20000030231011", "200000303323331021", "200000313301021",
"20000032031021","2000003220021", "2000003221011", "2000003231031",
"20000032311001","200000330330021", "2000003311211", "2000003331001",
"2000003331001","2000003331012", "20000033321012", "200000333231011",
"20000033323331021","20000033331021", "2000010320011", "20000103323331021",
"200001113011001","20000113011001", "20000120330021", "20000123033011",
"2000012331131","2000013011001", "2000013301021", "200001330231011",
"2000013323001","20000133231311", "20000133301001", "200001333331011",
"200001333331011","200001333331011", "200001333331011", "20000200331021",
"20000200331021","20000200331131", "20000203221011", "2000020333133011",
"20000212221111","20000213301021", "2000021331011", "200002223231011")),
row.names = c(1L,3L, 5L, 6L, 7L, 8L, 9L, 10L, 12L, 13L, 14L, 16L, 17L, 18L, 19L,20L, 21L,
23L, 24L, 27L, 29L, 31L, 32L, 33L, 34L, 35L, 38L, 41L,42L, 43L, 46L, 47L, 48L,
49L, 58L, 59L, 60L, 62L, 63L, 64L, 66L,68L, 71L, 72L, 73L, 74L, 75L, 77L, 78L,
79L, 80L, 81L, 82L, 83L,84L, 85L, 89L, 90L, 91L, 95L), class = "data.frame")
, my goal is to cluster them by editing distance < 3.
dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa
rownames(dist_mtx) <- dist_mtx$alfa
then created an edge list , while the value represents the editing distance between any 2 sequences:
edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]
then created the igraph object :
igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)
then i tried numerous methods to try and cluster those sequences with louvain method and im still getting clusters which its members have editing distance > 3 , im aware that it might be because of the connected components. so my questions are :
- is there a way to cluster to sequences together so that in each cluster the members would be with editing distance < 3 ?
- is there a way to recognize the cluster centers (HUBS) , tried hubness.score() and assign vertices according to those centers with consideration of the editing distance ?
this is my first post , i will appreciate any help