0

I am trying to loop through 53 rows in a data.frame and create an adjacency matrix with the results. However, my efforts continue to be stalled by the fact that the loop will not run correctly.

I have tried to create matches as well as applying numerous count() functions, without success.

MRE: (In truth, the data is a lot larger so my unique search is actually 217k elements)

df1<-data.frame(col1=c(12345,123456,1234567,12345678),
col2=c(54321,54432,12345,76543),
col3=c(11234,12234,1234567,123345),
col4=c(54321,54432,12345,76543))

search<-c(12345,1234567,75643,54432)

I would like to loop through each row and update a new matrix/df where the count per number in [search] would be the output.

Ex:

df2

        12345     1234567    75643    54432
row1    TRUE       TRUE      FALSE    FALSE
row2    FALSE      FALSE     TRUE      TRUE
row3    TRUE       TRUE      FALSE    FALSE
row4    TRUE       FALSE     TRUE     TRUE
OctoCatKnows
  • 399
  • 3
  • 17
  • Just a doubt, is the expected output `df2` ? – akrun Jun 26 '19 at 14:43
  • indeed it is. although, probably going to be massive. Once I figure this out, I am going to provide a weight to the variables (ids in 'search' vector). E.g. '12345' and '1234567' occur n_times within the df. Basically a 'From' 'To' edgelist with 'From' being the unique id (search) and 'To' being the shared id. – OctoCatKnows Jun 26 '19 at 14:50

2 Answers2

1

I think you should check the tf (term frequency) algorithm for text mining. Here an interesting approach for your example using the library(quanteda) to create the matrix with the counts. Then you can do the searches you feel like based on counts:

library("tibble")
library("quanteda")


df1<-data.frame(col1=c(12345,123456,1234567,12345678),
                col2=c(54321,54432,12345,76543),
                col3=c(11234,12234,1234567,123345),
                col4=c(54321,54432,12345,76543))
df2<-apply(df1,2,paste, collapse = " ") # Passing it to string format
DocTerm <- quanteda::dfm(df2)
DocTerm

Document-feature matrix of: 4 documents, 10 features (60.0% sparse).
4 x 10 sparse Matrix of class "dfm"
      features
docs   12345 123456 1234567 12345678 54321 54432 76543 11234 12234 123345
  col1     1      1       1        1     0     0     0     0     0      0
  col2     1      0       0        0     1     1     1     0     0      0
  col3     0      0       1        0     0     0     0     1     1      1
  col4     1      0       0        0     1     1     1     0     0      0

I hope this helps !

Carles
  • 2,731
  • 14
  • 25
  • Hi, I ran this but I am having trouble exmaning (to get the output you show). I have tried (DocTerm@i and others but I am just seeing a bunch of lists) – OctoCatKnows Jun 26 '19 at 15:45
  • To me it works perfectly. Check your R version. That might cause the trouble. To extract staff, I hope this other question might help you out: https://stackoverflow.com/questions/53540787/finding-cosine-similarity-of-documents-and-their-removal-from-r-dataframe/53542848#53542848 – Carles Jun 26 '19 at 17:55
  • Hmmm. This case works fine,however, when applied to larger scale (unique ids=75000) the matrix is not available. Im going to try with significantly larger sample to see if it repeats – OctoCatKnows Jun 26 '19 at 20:25
1

While it is unclear how your counts are derived as there might even be a typo (75643 != 76543) or if you are running by rows or columns, consider a nested sapply and apply solution for both margins:

By Row

search <- c(12345, 1234567, 76543, 54432)                                # ADJUSTED TYPO    
mat <- sapply(search, function(s) apply(df1, 1, function(x) s %in% x))   # 1 FOR ROW MARGIN

colnames(mat) <- search
rownames(mat) <- paste0("row", seq(nrow(df1)))

mat
#      12345 1234567 76543 54432
# row1  TRUE   FALSE FALSE FALSE
# row2 FALSE   FALSE FALSE  TRUE
# row3  TRUE    TRUE FALSE FALSE
# row4 FALSE   FALSE  TRUE FALSE

By Column

search <- c(12345, 1234567, 76543, 54432)                                # ADJUSTED TYPO
mat <- sapply(search, function(s) apply(df1, 2, function(x) s %in% x))   # 2 FOR COL MARGIN

colnames(mat) <- search
rownames(mat) <- paste0("col", seq(ncol(df1)))

mat
#      12345 1234567 76543 54432
# col1  TRUE    TRUE FALSE FALSE
# col2  TRUE   FALSE  TRUE  TRUE
# col3 FALSE    TRUE FALSE FALSE
# col4  TRUE   FALSE  TRUE  TRUE

Rextester demo

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Parfait, this is exacly what I was looking for. However - any idea if we can run apply with dplyr to get a weight count ? The matrix is massive (10 obs = 25mb) and I have 800...so thats gonna add up haha. I am hoping I can loop in a count function to get the number of times numbers appear together in a column....something like [$Freq... 12345:1234567 (5) | 12345|76543 (2) etc.]. *May help in contolling the RAM overkill and its is what the endstate is anyway – OctoCatKnows Jun 26 '19 at 16:22
  • 1
    Great. Glad to help. The weighting question should be asked on a different post as you will need to clearly describe and show what you mean with earnest attempt at solution. – Parfait Jun 26 '19 at 16:24