calculate co-occurrences

Question

screenshot

I have a file as shown in the screenshot attached. There are 61 events(peaks) and I want to find how often each peak occurs with the other (co-occurrence) for all possible combinations. The file has the frequency (number of times the peak appears in the 47 samples) and probability (no.of times the peak occurs divided by total number of samples).

Then I want to find mutually exclusive peaks using the formula p(x,y) / p(x)*p(y), where p(x,y) is the probability that x and y co-occur, p(x) is probability of peak (x) and p(y) is the probability of peak y.

What is the best way to solve such a problem? Do I need to write a Perl script or are there some R functions I could use? I am a biologist trying to learn Perl and R so I would appreciate some example code to solve this problem.

There is certainly a convenient way to do this kind of operation in R. Instead of providing a screenshot, take a very small sample of your data and provide it. Then provide exactly what you want as output. — nograpes, May 17 '12 at 21:36
Here's a post that will show you how to do that: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . You should make the example fairly minimal, and may want to just adapt the example data I provided in my answer. — Josh O'Brien, May 17 '12 at 22:30

Josh O'Brien · Accepted Answer · 2012-05-17T22:55:45.177

1

In the following, I've assumed that what you alternately call p(xy) and p(x,y) should actually be the probability (rather than the number of times) that x and y co-occur. If that's not correct, just remove the division by nrow(X) from the 2nd line below.

# As an example, create a sub-matrix of your data
X <- cbind(c(0,0,0,0,0,0), c(1,0,0,1,1,1), c(1,1,0,0,0,0))

num <- (t(X) %*% X)/nrow(X)              # The numerator of your expression   
means <- colMeans(X)                     # A vector of means of each column
denom <- outer(colMeans(X), colMeans(X)) # The denominator
out <- num/denom
#      [,1] [,2] [,3]
# [1,]  NaN  NaN  NaN
# [2,]  NaN 1.50 0.75
# [3,]  NaN 0.75 3.00

Note: The NaNs in the results are R's way of indicating that those cells are "Not a number" (since they are each the result of dividing 0 by 0).

edited May 17 '12 at 22:55

answered May 17 '12 at 21:35

Josh O'Brien

159,210
26
366
455

The problem is that in my matrix (X) there are 61 peaks(rows) and 47 samples(columns) so when i use your method, i end up with 47 rows and 47 columns. How do i identify the probability for each pair of peaks? – Tjb LaMac May 18 '12 at 15:07
After reading in your data, you can just transpose it, doing `X <- t(X)`, then use the code in my example. It's pretty standard in data analysis generally (and R specifically) to have each column be a separate feature, and each row be an observation, so it'll be worth your while to keep your data in that format in R. – Josh O'Brien May 18 '12 at 15:34

user1317221_G · Answer 2 · 2012-05-19T13:46:53.290

your question is not completely clear without a proper example but I am thinking this result is along the lines of what you want i.e. "I want to find how often each peak occurs with the other (co-occurrence) "

library(igraph)
library(tnet)
library(bipartite)

#if you load your data in as a matrix e.g.

mat<-matrix(c(1,1,0,2,2,2,3,3,3,4,4,0),nrow=4,byrow=TRUE) # e.g.

 #     [,1] [,2] [,3]   #  your top line as columns  e.g.81_05  131_00 and peaks as rows
#[1,]    1    1    0
#[2,]    2    2    2
#[3,]    3    3    3
#[4,]    4    4    0

then

pairs<-web2edges(mat,return=TRUE)
pairs<- as.tnet(pairs,type="weighted two-mode tnet")
peaktopeak<-projecting_tm(pairs, method="sum")
peaktopeak

#peaktopeak
#   i j w
#1  1 2 2 # top row here says peak1 and peak2 occurred together twice
#2  1 3 2
#3  1 4 2
#4  2 1 4
#5  2 3 6
#6  2 4 4
#7  3 1 6
#8  3 2 9
#9  3 4 6
#10 4 1 8
#11 4 2 8
#12 4 3 8  # peak4 occured with peak3 8 times

EDIT: If mutually exclusive peaks that do not occur are just those that do not share 1s in the same columns as your original data then you can just see this in peaktopeak. For instance if peak 1 and 3 never occur they wont be found in peaktopeak in the same row.

To look at this easier you could:

peakmat <- tnet_igraph(peaktopeak,type="weighted one-mode tnet")
peakmat<-get.adjacency(peakmat,attr="weight")

e.g.:

#     [,1] [,2] [,3] [,4]
#[1,]    0    2    2    2
#[2,]    4    0    6    4
#[3,]    6    9    0    6
#[4,]    8    8    8    0 # zeros would represent peaks that never co occur. 

#In this case everything shares at least 2 co-occurrences 
#diagonals are 0 as saying  peak1 occurs with itself is obviously silly.

Excellent! Will you please suggest how then to find mutually exclusive peaks probably using the formula p(x,y)/p(x)*p(y)? — Tjb LaMac, May 18 '12 at 16:07
for your p(x,y)/p(x)*p(y) i think @Josh did a good job of showing you how that would work. — user1317221_G, May 18 '12 at 16:37

calculate co-occurrences

2 Answers2