Count every possible pair of values in a column grouped by multiple columns

Question

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10 238000 238600 9_EnhA1  GATA6 
chr10 238000 238600 9_EnhA1  GATA4 
chr10 238000 238600 9_EnhA1    SRF 
chr10 238000 238600 9_EnhA1  MEF2A 
chr10 375200 375400 9_EnhA1  GATA6 
chr10 375200 375400 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1  GATA6 
chr10 440400 441000 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1    SRF 
chr10 440400 441000 9_EnhA1  MEF2A 
chr10 441600 442000 9_EnhA1    SRF 
chr10 441600 442000 9_EnhA1  MEF2A

I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID:

> dat

 id motif 
 1  GATA6 
 1  GATA4 
 1    SRF 
 1  MEF2A 
 2  GATA6 
 2  GATA4
 3  GATA6 
 3  GATA4 
 3    SRF 
 3  MEF2A 
 4    SRF 
 4  MEF2A

I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like,

motif1 motif2 count
 GATA6  GATA4     3
 GATA6    SRF     2
 GATA6  MEF2A     2
 ... and so on for each pair of motif

In the actual dataset, there are 1716 unique motifs. There are 83509 unique id.

Any suggestions on how to proceed?

I don't know if it would be practical, but a possible approach would be to (1) convert everything to integers, condensing your grouping variables into a single `id` as you had before, and then (2) create a 1716x1716x83509 3-d sparse array (using the `slam` package) where the entries are binary for presence/absence, (3) your result is the `colsums` along the `id` dimension. — Gregor Thomas, Oct 07 '14 at 21:20
(2) is obviously the hard part. Alternatively, maybe SQL could handle it, it would be a fairly trivial self-join and then aggregate query. — Gregor Thomas, Oct 07 '14 at 21:22
@Gregor no background in SQL although thanks for the suggestion. — Komal Rathi, Oct 07 '14 at 21:25
Could this be considered a duplicate of [this question](http://stackoverflow.com/a/24627329/1086688)? — nograpes, Oct 07 '14 at 23:29
@nograpes That's a completely different question. I wouldn't have asked without doing my research. — Komal Rathi, Oct 08 '14 at 14:01
@KomalRathi It isn't *exactly* the same... but that question had the same data structure, just the output format was slightly different (a matrix instead of triplet representation). I did find enough of a relationship to use it to develop an answer. — nograpes, Oct 08 '14 at 17:52

Arun · Accepted Answer · 2014-10-08T15:41:15.510

10

Updated: Here is a fast and memory efficient version using data.table:

Step 1: Construct sample data of your dimensions approximately:

require(data.table) ## 1.9.4+
set.seed(1L)        ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)

Step 2: Pre-processing:

DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]

Step 3: Solution:

ans = DT[DT, {
              tmp = runlen < i.runlen; 
              list(motif[tmp], i.motif[any(tmp)])
             }, 
      by=.EACHI][, .N, by="V1,V2"]

This takes ~27 seconds and ~1GB of memory during the final step 3.

The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.

HTH

PS: See revision for the older (+ slower) version.

edited Oct 08 '14 at 15:41

answered Oct 07 '14 at 22:54

Arun

116,683
26
284
387

1

I wonder why `cc` is a 12X1 vector for me and not for you. Also..I am getting the following error running this: ```Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i), : x.'motif' is a factor column being joined to i.'V1' which is type 'NULL'. Factor columns must join to factor or character columns.```. This is running `data.table` 1.9.5` – Mike.Gahan Oct 08 '14 at 02:10
1

@Mike.Gahan, I used `stringsAsFactors=FALSE` while loading in. Seems like `as.data.table(.)` returns a 1 column matrix when the matrix columns are factors.. Is that the issue you're running into? Would be happy to fix if you could take a look at why/what happens for you and file an issue, please? – Arun Oct 08 '14 at 05:36
Works perfect now that I added `stringsAsFactors=FALSE`. Is there a trick to when to use `as.data.table` vs `setDT` or just `data.table()`? Also...great work with `foverlaps`. Awesome new function. – Mike.Gahan Oct 08 '14 at 12:01
@Mike.Gahan, thanks :-). Have updated the answer with a *much* faster version. Scales very well. `setDT` works only on list and data.frames, not matrix (due to it's internal structure). So, we can't avoid `as.data.table()` here, as `combn` returns matrix. Elsewhere, I'd prefer `setDT`... – Arun Oct 08 '14 at 12:20
@Arun That's a very smart answer. Thanks! I need to work on my data table skills. – Komal Rathi Oct 08 '14 at 14:33
@KomalRathi you should checkout nograpes' answer. – Arun Oct 08 '14 at 17:00

score 5 · Answer 2 · edited May 23 '17 at 11:58

Here is a sparse matrix technique shamelessly borrowed from this question.

# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))

# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
      as.numeric(dat$id), 
      as.numeric(dat$motif),
      dimnames = list(levels(dat$id),levels(dat$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

data.frame(motif1 = levels(dat$motif)[tab$i],
           motif2 = levels(dat$motif)[tab$j],
           number = tab$x)

#    motif1 motif2 number
# 1  GATA4  GATA6      3
# 2  GATA4  MEF2A      2
# 3  GATA6  MEF2A      2
# 4  GATA4    SRF      2
# 5  GATA6    SRF      2
# 6  MEF2A    SRF      3

@Arun I suspect that there is a way to make this even faster by only calculating the "upper (or lower) triangle" of the matrix. — nograpes, Oct 08 '14 at 17:49

Mike.Gahan · Answer 3 · 2014-10-08T14:18:44.207

3

I think the data.table package is probably the most efficient here. We can count pairs within each group, and then aggregate. It is a much more efficient way with data your size compared to counting all pairs in total first.

#Bring in data.table and convert data to data.table
require(data.table)
setDT(dat)

#Summarize by two-way pairs
summ <- dat[ , list(motifs=list(combn(unique(as.character(motif)),
   min(2,length(unique(as.character(motif)))), by=list(chr,start,end,enhancer)]

#Transpose and gather data into one table
motifs.table <- rbindlist(lapply(summ$motifs,function(x) data.table(t(x))))

#Summarize table with counts
motifs.table[ , .N, by=list(V1,V2)]

#       V1    V2 N
# 1: GATA6 GATA4 3
# 2: GATA6   SRF 2
# 3: GATA6 MEF2A 2
# 4: GATA4   SRF 2
# 5: GATA4 MEF2A 2
# 6:   SRF MEF2A 3

edited Oct 08 '14 at 14:18

answered Oct 07 '14 at 22:04

Mike.Gahan

4,565
23
39

Thanks. When I try the second command summ <- ... I get the following error: Error in combn(unique(as.character(motif)), 2) : n < m – Komal Rathi Oct 08 '14 at 14:06
There must be some groups with only 1 motif. I guess this answer doesn't fully account for this. I will modify, but @Aruns answer is much better. – Mike.Gahan Oct 08 '14 at 14:16
Ok. Thanks for the help though. +1 for trying to help. – Komal Rathi Oct 08 '14 at 14:17

score 2 · Answer 4 · answered Oct 07 '14 at 21:34

2

If you can get your data into a SQL table called dat, this query should work:

select d1.motif m1, d2.motif m2, count(*) count
from dat d1
join dat d2
on d1.chr = d2.chr
  and d1.start = d2.start
  and d1.end = d2.end
  and d1.enhancer = d2.enhancer
  and d1.motif <> d2.motif
group by d1.motif, d2.motif

Given the size of your data, I doubt the R sqldf package can handle it, but with a free MySQL installation you could use RODBC or RJDBC to have R and SQL talk.

answered Oct 07 '14 at 21:34

Gregor Thomas

136,190
20
167
294

You can almost certainly mimic this behavior with R functions. – shadowtalker Oct 07 '14 at 21:58
Yes, you can. I'm just not comfortable enough with `data.table` to do it, and I'm doubtful about RAM using anything else. I think this is the logic implemented in @Arun's answer. – Gregor Thomas Oct 07 '14 at 23:00

Michael Lawrence · Answer 5 · 2014-10-08T00:05:34.833

You might benefit from formally modelling the semantics of your data. If you have ranges on the genome, use the GenomicRanges package from Bioconductor.

library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)

This is a GRanges object, which formally understands the notion of genomic location, so these operations just work:

hits <- findMatches(gr, gr)
tab <- table(motif1=gr$motif[queryHits(hits)],
             motif2=gr$motif[subjectHits(hits)])
subset(as.data.frame(tab, responseName="count"), motif1 != motif2)

shadowtalker · Answer 6 · 2014-10-08T01:25:52.633

1

...if this isn't what you want, I'm giving up. Obviously it isn't optimized for a large data set. This is just a general algorithm that takes natural advantage of R. There are several improvements possible, e.g. with dplyr and data.table. The latter will greatly speed up the [ and %in% operations here.

motif_pairs <- combn(unique(dat$motif), 2)
colnames(motif_pairs) <- apply(motif_pairs, 2, paste, collapse = " ")
motif_pair_counts <- apply(motif_pairs, 2, function(motif_pair) {
  sum(daply(dat[dat$motif %in% motif_pair, ], .(id), function(dat_subset){
    all(motif_pair %in% dat_subset$motif)
  }))
})
motif_pair_counts <- as.data.frame(unname(cbind(t(motif_pairs), motif_pair_counts)))
names(motif_pair_counts) <- c("motif1", "motif2", "count")
motif_pair_counts

#   motif1 motif2 count
# 1  GATA6  GATA4     3
# 2  GATA6    SRF     2
# 3  GATA6  MEF2A     2
# 4  GATA4    SRF     2
# 5  GATA4  MEF2A     2
# 6    SRF  MEF2A     3

Another old version. PLEASE make sure your question is clear!

This is precisely what plyr was designed to accomplish. Try dlply(dat, .(id), function(x) table(x$motif) ).

But please don't just try to copy and paste this solution without at least reading the documentation. plyr is a very powerful package and it will be very helpful for you to understand it.

Old post answering the wrong question:

Are you looking for disjoint or overlapping pairs?

Here's one solution using the function rollapply from package zoo:

library(zoo)

motif_pairs <- rollapply(dat$motif, 2, c)              # get a matrix of pairs
motif_pairs <- apply(motif_pairs, 1, function(row) {   # for every row...
  paste0(sort(row), collapse = " ")                    #   sort the row, and concatenate it to a single string
                                                       #   (sorting ensures that pairs are not double-counted)
})
table(motif_pairs)                                     # since each pair is now represented by a unique string, just tabulate the string appearances

## if you want disjoint pairs, do `rollapply(dat$motif, 2, c, by = 2)` instead

Take a look at the docs for rollapply if this isn't quite what you need. For grouping by other variables, you can combine this with one of:

base R functions aggregate or by (not recommended), or
the *ply functions from plyr (better)

edited Oct 08 '14 at 01:25

answered Oct 07 '14 at 21:31

shadowtalker

12,529
3
53
96

I have grouped my variables into one single id. – Komal Rathi Oct 07 '14 at 21:41
What do you mean by disjoint and overlapping pairs? I just want to count the number of times a pair occurs across every group. How do I convert the numbers in the output to motif names? – Komal Rathi Oct 07 '14 at 21:45
@KomalRathi if this isn't right, you need to be more clear about what you mean by a "pair". – shadowtalker Oct 07 '14 at 21:50
I did not mean to offend you and no way I was implying that you are incorrect. I meant to ask you what are disjoint and overlapping pairs and then I just tried to explain what I need. By pair, I mean each possible pair of motifs. I have also provided an example of what the desired output should look like based on the data that I have shown. In the data that is shown, GATA6 and GATA4 are two motifs. They co-occur in 3 different ids. So the count is 3. Similarly, GATA6 and SRF co-occur in 2 different ids so the count is 2. – Komal Rathi Oct 07 '14 at 21:51
No offense taken. I'm assuming you mean *adjacent* pairs, but that's not clear from your post. – shadowtalker Oct 07 '14 at 21:56
Given a sequence `(1 2 3 4 5)`, overlapping adjacent pairs would be `(1 2) (2 3) (3 4) (4 5)` whereas disjoint adjacent pairs would be `(1 2) (3 4) (5)`. Which of those are you looking for? – shadowtalker Oct 07 '14 at 21:57
No I mean, every possible pair of motif. Not just adjacent pairs. I have shown in the question also. Also if you are unclear, read the big comment that I have written. – Komal Rathi Oct 07 '14 at 21:58
Oh, I didn't notice the ID variable. Then this is incredibly easy. I'll edit my post. – shadowtalker Oct 07 '14 at 22:00
Grouping by `id` doesn't work either, not unless you do a rollup of all the tables at the end. In OP's example output, the `GATA6` `GATA4` row has count 3 because that is the count of unique `id`s in which `GATA4` and `GATA6` appear together. – Gregor Thomas Oct 07 '14 at 22:56
@Gregor I was under the impression OP wanted that table for each `id`. – shadowtalker Oct 08 '14 at 01:06

score 1 · Answer 7 · answered Oct 07 '14 at 22:25

1

What about this?:

res1<- split(dat$motif,dat$id)
res2<- lapply(res1,function(x) combn(x,2))
res3<- apply(do.call(cbind,res2),2,function(x) paste(x[1],x[2],sep="_"))

table(res3)

answered Oct 07 '14 at 22:25

DatamineR

10,428
3
25
45

Count every possible pair of values in a column grouped by multiple columns

7 Answers7

Linked