0

I have a dataframe my_df with 10,000 different sequences with different lengths (between 13to18) they comprised from different numbers (0-3)

example of my data (60 lines) :

library(stringdist)
library(igraph)
library(reshape2)


structure(list(alfa_ch = c("2000000232003211","2000000331021", "20000003310320011", "20000003323331021", 
                           "20000003331001","20000003331001", "20000003332021", "200000100331021",
                           "20000013011001","20000013301021", "2000001333331011", "20000023231031",
                           "200000233302001","20000023331011", "20000023331012", "20000023332021",
                           "200000233331021","20000030231011", "200000303323331021", "200000313301021",
                           "20000032031021","2000003220021", "2000003221011", "2000003231031",
                           "20000032311001","200000330330021", "2000003311211", "2000003331001",
                           "2000003331001","2000003331012", "20000033321012", "200000333231011",
                           "20000033323331021","20000033331021", "2000010320011", "20000103323331021",
                           "200001113011001","20000113011001", "20000120330021", "20000123033011",
                           "2000012331131","2000013011001", "2000013301021", "200001330231011",
                           "2000013323001","20000133231311", "20000133301001", "200001333331011",
                           "200001333331011","200001333331011", "200001333331011", "20000200331021",
                           "20000200331021","20000200331131", "20000203221011", "2000020333133011",
                           "20000212221111","20000213301021", "2000021331011", "200002223231011")),
          row.names = c(1L,3L, 5L, 6L, 7L, 8L, 9L, 10L, 12L, 13L, 14L, 16L, 17L, 18L, 19L,20L, 21L,
                        23L, 24L, 27L, 29L, 31L, 32L, 33L, 34L, 35L, 38L, 41L,42L, 43L, 46L, 47L, 48L,
                        49L, 58L, 59L, 60L, 62L, 63L, 64L, 66L,68L, 71L, 72L, 73L, 74L, 75L, 77L, 78L,
                        79L, 80L, 81L, 82L, 83L,84L, 85L, 89L, 90L, 91L, 95L), class = "data.frame")

, my goal is to cluster them by editing distance < 3.

dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa
rownames(dist_mtx) <- dist_mtx$alfa

then created an edge list , while the value represents the editing distance between any 2 sequences:

edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]

then created the igraph object :

igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)

then i tried numerous methods to try and cluster those sequences with louvain method and im still getting clusters which its members have editing distance > 3 , im aware that it might be because of the connected components. so my questions are :

  1. is there a way to cluster to sequences together so that in each cluster the members would be with editing distance < 3 ?
  2. is there a way to recognize the cluster centers (HUBS) , tried hubness.score() and assign vertices according to those centers with consideration of the editing distance ?

this is my first post , i will appreciate any help

melato
  • 1
  • 1
  • Welcome to SO. It will be easier to try to help if you question includes a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. You can share (part of) your data using `dput` – desval Jan 03 '21 at 17:18
  • Supposed dist(A,B) is 2 and dist(B,C) is 2 and dist(A,C) is 4. What do you want to do? – G5W Jan 14 '21 at 00:57

0 Answers0