0

Question


Let's say I have this dataframe:

# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)

Will be something like:

     cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.

Idea


# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations

What I tried


# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible


# what I tried
update.adj <- function( data ) {
  for (comb in combn(data$letters,2)) {
    # stucked
  }
}

test.set %>% group_by(cluster.id) %>% update.adj(.)

Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear


Answer to comment
Answer to @Manuel Bickel: For the data I gave as example (the table under "will be something like"): This matrix will be A-->Z for the full dataset, keep that in mind.

  A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

I will explain what I did:

    cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):

 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E

Now I look for each cluster what pairwise combinations I can make:

cluster 3:

A,E
A,D
A,C
E,D
E,C
D,C

Update these combination in the adjacency matrix:

    A B C D E
    A 0 0 1 1 1
    B 0 0 0 0 0
    C 1 0 0 1 1
    D 1 0 1 0 1
    E 2 0 1 1 0

Then go to the next cluster

cluster 2

A,E

Update the adjacency matrix again:

 A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

As reaction to the huge dataset

library(reshape2)

test.set <- read.table(text = "
                            cluster.id   letters
                       1          5       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)

x1 <- reshape2::dcast(test.set, cluster.id ~ letters)

x1
#cluster.id A B C D E
#1          1 1 0 0 0 0
#2          2 1 0 0 0 1
#3          3 1 0 1 1 1
#4          4 0 2 0 0 0
#5          5 1 0 0 0 0

x2 <- table(test.set)

x2
#          letters
#cluster.id A B C D E
#         1 1 0 0 0 0
#         2 1 0 0 0 1
#         3 1 0 1 1 1
#         4 0 2 0 0 0
#         5 1 0 0 0 0


x1.c <- crossprod(x1)
#Error in crossprod(x, y) : 
#  requires numeric/complex matrix/vector arguments

x2.c <- crossprod(x2)
#works fine
CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • I do not fully understand what your expected output should look like. Could you provide an example, thank you. – Manuel Bickel Nov 22 '17 at 10:11
  • It's the adj.df filled with counts indicating how often a combination was found in each cluster, does this make sense? @ManuelBickel – CodeNoob Nov 22 '17 at 10:15
  • I get the part about the combinations within an individual cluster, but I do not fully understand what the output of `update.adj` shall be. Could you provide a short example output (can be very short, e.g., 2x2 or so) – Manuel Bickel Nov 22 '17 at 10:19
  • @ManuelBickel I updated my question, hopefully it's clear now, please let me know if not – CodeNoob Nov 22 '17 at 10:37
  • Thanks for the update, I think its more or less clear now. I`ll have a look at it later or tomorrow depending on my schedule... – Manuel Bickel Nov 22 '17 at 11:48
  • After having checked other questions and answers I think the solution proposed by [Tyler Rinker](https://stackoverflow.com/users/1000343/tyler-rinker) in [this answer](https://stackoverflow.com/questions/21419507/adjacency-matrix-in-r) is what you want. Simply apply it on your `test.set`. An additional side note, your example is quite well now, and it is very good that you have provided the code to generate data, just next time use `set.seed()` for the random number generator so others can exactly reproduce your data. Please, tell me if the solution works for you... – Manuel Bickel Nov 22 '17 at 20:35
  • Possible duplicate of [Adjacency matrix in R](https://stackoverflow.com/questions/21419507/adjacency-matrix-in-r) – Manuel Bickel Nov 23 '17 at 08:06

1 Answers1

2

Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.

UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.

test.set <- read.table(text = "
                            cluster.id   letters
                       1          5       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)

x <- table(test.set)
x
          letters
#cluster.id A B C D E
#         1 1 0 0 0 0
#         2 1 0 0 0 1
#         3 1 0 1 1 1
#         4 0 2 0 0 0
#         5 1 0 0 0 0

#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x

#         letters
# letters 
#         A B C D E
#       A 0 0 1 1 2
#       B 0 0 0 0 0
#       C 1 0 0 1 1
#       D 1 0 1 0 1
#       E 2 0 1 1 0

#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
#   A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • Thankyou, how does this check whether the letters are in the same cluster? – CodeNoob Nov 23 '17 at 09:02
  • I have added the output of the `table()` call in my answer. This gives you the counts of each letter per cluster. The crossproduct finally checks all the counts of all possible combinations so to speak, which are the adjacency counts you are looking for (another way to write the crossproduct of a matrix m would be `m %*% t(m)`). Does that help? – Manuel Bickel Nov 23 '17 at 09:29
  • Look so easy hahah, only problem with this is that I get the error: "Error in table(tab) : attempt to make a table with >= 2^31 elements" because my original dataset is huge ;( @Manuel Bickel – CodeNoob Nov 23 '17 at 10:56
  • I have ~2 million rows in the format that I gave as example (probably should have mentioned that though) – CodeNoob Nov 23 '17 at 11:01
  • I see, you have not mentioned that you have such large number of clusters. Does it work if you use the `reshape2` package like so: `reshape2::dcast(test.set, cluster.id ~ letters)`. Have never tested it with such large data, but maybe the package can handle your amount... If not we might have to think about another solution, or you might have to pose another question which is about how to handle the size... Probably the packages `Matrix` or `data.table()` can solve this issue. – Manuel Bickel Nov 23 '17 at 11:04
  • I updated my question with the code, it seems to work fine with reshpae2 however this will give an error when applying the crossprod command. (I accidentally edited your answer instead of my question, sorry! ) – CodeNoob Nov 23 '17 at 11:46
  • Updated the answer. If it works please consider accepting the answer. – Manuel Bickel Nov 23 '17 at 12:49
  • 1
    Thankyou works fine now, p.s you could change dcast to acast, so it returns a matrix automatically, then "as.matrix(x[,-1]" can be just replaced by x – CodeNoob Nov 23 '17 at 15:39
  • Good idea, have not thought through all options offered by `reshape2` when responding. Currently travelling, therefore, I can only change that tomorrow, but I will. I am glad the solution works for you. – Manuel Bickel Nov 23 '17 at 16:25