2

I have an array for which I would like to obtain a measure of the similarity between values in each column. By which I mean I wish to compare the rows between pairwise columns of the array and increment a measure when their values match. The resulting measure would then be at a maximum for two columns exactly the same.

Essentially my problem is the same as discussed here: R: Compare all the columns pairwise in matrix except that I do not wish empty cells to be counted.

With the example data created from code derived from the linked page:

data1 <- c("", "B", "", "", "")
data2 <- c("A", "", "", "", "")
data3 <- c("", "", "C", "", "A")
data4 <- c("", "", "", "", "")
data5 <- c("", "", "C", "", "A")
data6 <- c("", "B", "C", "", "")

my.matrix <- cbind(data1, data2, data3, data4, data5, data6)

similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))
for(col in 1:ncol(my.matrix)){
  matches <- my.matrix[,col] == my.matrix
  match.counts <- colSums(matches)
  match.counts[col] <- 0 
  similarity.matrix[,col] <- match.counts

}

I obtain:

similarity.matrix =

    V1  V2  V3  V4  V5  V6
1   0   3   2   4   2   4
2   3   0   2   4   2   2
3   2   2   0   3   5   3
4   4   4   3   0   3   3
5   2   2   5   3   0   3
6   4   2   3   3   3   0

which counts non-value pairs.

My desired output would be:

expected.output =

    V1  V2  V3  V4  V5  V6
1   0   0   0   0   0   1
2   0   0   0   0   0   0
3   0   0   0   0   2   1
4   0   0   0   0   0   0
5   0   0   2   0   0   1
6   1   0   1   0   1   0

Thanks,

Matt

Community
  • 1
  • 1
mattbawn
  • 1,358
  • 2
  • 13
  • 33
  • 3
    Can you show the expected output?. Try replacing the `''` with `NA` i.e. `is.na(matrix) <- matrix==''` and in your loop `match.counts <- colSums(matches, na.rm=TRUE)` – akrun Jun 10 '15 at 19:54
  • Yes. I was just checking that with my real data. Could you please ad this as the answer then? – mattbawn Jun 10 '15 at 20:09
  • I can't see it any more. I think its been removed. – mattbawn Jun 10 '15 at 20:12
  • 2
    I see your loop, but a verbal explanation of what is meant by "similarity" would improve the question a great deal. I don't really think a link to another question is a substitute for explaining it here. By the way, you might not want to call anything `matrix`, since that's the name of a commonly used function. – Frank Jun 10 '15 at 21:03

1 Answers1

1

So the following is the answer from akrun :

first changing the blank cells to NA's

is.na(my.matrix) <- my.matrix==''

and then removing the NA's for the match.counts

similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))

for(col in 1:ncol(my.matrix)){
  matches <- my.matrix[,col] == my.matrix
  match.counts <- colSums(matches, na.rm=TRUE)
  match.counts[col] <- 0 
  similarity.matrix[,col] <- match.counts

}

Which did indeed give me my desired output:

    V1  V2  V3  V4  V5  V6
1   0   0   0   0   0   1
2   0   0   0   0   0   0
3   0   0   0   0   2   1
4   0   0   0   0   0   0
5   0   0   2   0   0   1
6   1   0   1   0   1   0

thank you.

mattbawn
  • 1,358
  • 2
  • 13
  • 33