5

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10 238000 238600 9_EnhA1  GATA6 
chr10 238000 238600 9_EnhA1  GATA4 
chr10 238000 238600 9_EnhA1    SRF 
chr10 238000 238600 9_EnhA1  MEF2A 
chr10 375200 375400 9_EnhA1  GATA6 
chr10 375200 375400 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1  GATA6 
chr10 440400 441000 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1    SRF 
chr10 440400 441000 9_EnhA1  MEF2A 
chr10 441600 442000 9_EnhA1    SRF 
chr10 441600 442000 9_EnhA1  MEF2A 

I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID:

> dat

 id motif 
 1  GATA6 
 1  GATA4 
 1    SRF 
 1  MEF2A 
 2  GATA6 
 2  GATA4
 3  GATA6 
 3  GATA4 
 3    SRF 
 3  MEF2A 
 4    SRF 
 4  MEF2A 

I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like,

motif1 motif2 count
 GATA6  GATA4     3
 GATA6    SRF     2
 GATA6  MEF2A     2
 ... and so on for each pair of motif

In the actual dataset, there are 1716 unique motifs. There are 83509 unique id.

Any suggestions on how to proceed?

gmsantos
  • 1,412
  • 22
  • 30
Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • How many unique `id`s are in your data? – Gregor Thomas Oct 07 '14 at 20:42
  • @Gregor I will edit my question for clarity in a few mins. – Komal Rathi Oct 07 '14 at 21:00
  • I don't know if it would be practical, but a possible approach would be to (1) convert everything to integers, condensing your grouping variables into a single `id` as you had before, and then (2) create a 1716x1716x83509 3-d sparse array (using the `slam` package) where the entries are binary for presence/absence, (3) your result is the `colsums` along the `id` dimension. – Gregor Thomas Oct 07 '14 at 21:20
  • (2) is obviously the hard part. Alternatively, maybe SQL could handle it, it would be a fairly trivial self-join and then aggregate query. – Gregor Thomas Oct 07 '14 at 21:22
  • @Gregor no background in SQL although thanks for the suggestion. – Komal Rathi Oct 07 '14 at 21:25
  • Could this be considered a duplicate of [this question](http://stackoverflow.com/a/24627329/1086688)? – nograpes Oct 07 '14 at 23:29
  • @nograpes That's a completely different question. I wouldn't have asked without doing my research. – Komal Rathi Oct 08 '14 at 14:01
  • @KomalRathi It isn't *exactly* the same... but that question had the same data structure, just the output format was slightly different (a matrix instead of triplet representation). I did find enough of a relationship to use it to develop an answer. – nograpes Oct 08 '14 at 17:52

7 Answers7

10

Updated: Here is a fast and memory efficient version using data.table:

  • Step 1: Construct sample data of your dimensions approximately:

    require(data.table) ## 1.9.4+
    set.seed(1L)        ## For reproducibility
    N = 2724098L
    motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
    id = sample(83509, N, TRUE)
    DT = data.table(id, motif)
    
  • Step 2: Pre-processing:

    DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
    setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
    setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
    DT[, runlen := .I]
    
  • Step 3: Solution:

    ans = DT[DT, {
                  tmp = runlen < i.runlen; 
                  list(motif[tmp], i.motif[any(tmp)])
                 }, 
          by=.EACHI][, .N, by="V1,V2"]
    

    This takes ~27 seconds and ~1GB of memory during the final step 3.

The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.

HTH

PS: See revision for the older (+ slower) version.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    I wonder why `cc` is a 12X1 vector for me and not for you. Also..I am getting the following error running this: ```Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i), : x.'motif' is a factor column being joined to i.'V1' which is type 'NULL'. Factor columns must join to factor or character columns.```. This is running `data.table` 1.9.5` – Mike.Gahan Oct 08 '14 at 02:10
  • 1
    @Mike.Gahan, I used `stringsAsFactors=FALSE` while loading in. Seems like `as.data.table(.)` returns a 1 column matrix when the matrix columns are factors.. Is that the issue you're running into? Would be happy to fix if you could take a look at why/what happens for you and file an issue, please? – Arun Oct 08 '14 at 05:36
  • Works perfect now that I added `stringsAsFactors=FALSE`. Is there a trick to when to use `as.data.table` vs `setDT` or just `data.table()`? Also...great work with `foverlaps`. Awesome new function. – Mike.Gahan Oct 08 '14 at 12:01
  • @Mike.Gahan, thanks :-). Have updated the answer with a *much* faster version. Scales very well. `setDT` works only on list and data.frames, not matrix (due to it's internal structure). So, we can't avoid `as.data.table()` here, as `combn` returns matrix. Elsewhere, I'd prefer `setDT`... – Arun Oct 08 '14 at 12:20
  • @Arun That's a very smart answer. Thanks! I need to work on my data table skills. – Komal Rathi Oct 08 '14 at 14:33
  • @KomalRathi you should checkout nograpes' answer. – Arun Oct 08 '14 at 17:00
5

Here is a sparse matrix technique shamelessly borrowed from this question.

# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))

# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
      as.numeric(dat$id), 
      as.numeric(dat$motif),
      dimnames = list(levels(dat$id),levels(dat$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

data.frame(motif1 = levels(dat$motif)[tab$i],
           motif2 = levels(dat$motif)[tab$j],
           number = tab$x)

#    motif1 motif2 number
# 1  GATA4  GATA6      3
# 2  GATA4  MEF2A      2
# 3  GATA6  MEF2A      2
# 4  GATA4    SRF      2
# 5  GATA6    SRF      2
# 6  MEF2A    SRF      3
Community
  • 1
  • 1
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • 1
    @Arun I suspect that there is a way to make this even faster by only calculating the "upper (or lower) triangle" of the matrix. – nograpes Oct 08 '14 at 17:49
3

I think the data.table package is probably the most efficient here. We can count pairs within each group, and then aggregate. It is a much more efficient way with data your size compared to counting all pairs in total first.

#Bring in data.table and convert data to data.table
require(data.table)
setDT(dat)

#Summarize by two-way pairs
summ <- dat[ , list(motifs=list(combn(unique(as.character(motif)),
   min(2,length(unique(as.character(motif)))), by=list(chr,start,end,enhancer)]

#Transpose and gather data into one table
motifs.table <- rbindlist(lapply(summ$motifs,function(x) data.table(t(x))))

#Summarize table with counts
motifs.table[ , .N, by=list(V1,V2)]

#       V1    V2 N
# 1: GATA6 GATA4 3
# 2: GATA6   SRF 2
# 3: GATA6 MEF2A 2
# 4: GATA4   SRF 2
# 5: GATA4 MEF2A 2
# 6:   SRF MEF2A 3
Mike.Gahan
  • 4,565
  • 23
  • 39
2

If you can get your data into a SQL table called dat, this query should work:

select d1.motif m1, d2.motif m2, count(*) count
from dat d1
join dat d2
on d1.chr = d2.chr
  and d1.start = d2.start
  and d1.end = d2.end
  and d1.enhancer = d2.enhancer
  and d1.motif <> d2.motif
group by d1.motif, d2.motif

Given the size of your data, I doubt the R sqldf package can handle it, but with a free MySQL installation you could use RODBC or RJDBC to have R and SQL talk.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • You can almost certainly mimic this behavior with R functions. – shadowtalker Oct 07 '14 at 21:58
  • Yes, you can. I'm just not comfortable enough with `data.table` to do it, and I'm doubtful about RAM using anything else. I think this is the logic implemented in @Arun's answer. – Gregor Thomas Oct 07 '14 at 23:00
2

You might benefit from formally modelling the semantics of your data. If you have ranges on the genome, use the GenomicRanges package from Bioconductor.

library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)

This is a GRanges object, which formally understands the notion of genomic location, so these operations just work:

hits <- findMatches(gr, gr)
tab <- table(motif1=gr$motif[queryHits(hits)],
             motif2=gr$motif[subjectHits(hits)])
subset(as.data.frame(tab, responseName="count"), motif1 != motif2)
Michael Lawrence
  • 1,031
  • 5
  • 6
1

...if this isn't what you want, I'm giving up. Obviously it isn't optimized for a large data set. This is just a general algorithm that takes natural advantage of R. There are several improvements possible, e.g. with dplyr and data.table. The latter will greatly speed up the [ and %in% operations here.

motif_pairs <- combn(unique(dat$motif), 2)
colnames(motif_pairs) <- apply(motif_pairs, 2, paste, collapse = " ")
motif_pair_counts <- apply(motif_pairs, 2, function(motif_pair) {
  sum(daply(dat[dat$motif %in% motif_pair, ], .(id), function(dat_subset){
    all(motif_pair %in% dat_subset$motif)
  }))
})
motif_pair_counts <- as.data.frame(unname(cbind(t(motif_pairs), motif_pair_counts)))
names(motif_pair_counts) <- c("motif1", "motif2", "count")
motif_pair_counts

#   motif1 motif2 count
# 1  GATA6  GATA4     3
# 2  GATA6    SRF     2
# 3  GATA6  MEF2A     2
# 4  GATA4    SRF     2
# 5  GATA4  MEF2A     2
# 6    SRF  MEF2A     3

Another old version. PLEASE make sure your question is clear!

This is precisely what plyr was designed to accomplish. Try dlply(dat, .(id), function(x) table(x$motif) ).

But please don't just try to copy and paste this solution without at least reading the documentation. plyr is a very powerful package and it will be very helpful for you to understand it.


Old post answering the wrong question:

Are you looking for disjoint or overlapping pairs?

Here's one solution using the function rollapply from package zoo:

library(zoo)

motif_pairs <- rollapply(dat$motif, 2, c)              # get a matrix of pairs
motif_pairs <- apply(motif_pairs, 1, function(row) {   # for every row...
  paste0(sort(row), collapse = " ")                    #   sort the row, and concatenate it to a single string
                                                       #   (sorting ensures that pairs are not double-counted)
})
table(motif_pairs)                                     # since each pair is now represented by a unique string, just tabulate the string appearances

## if you want disjoint pairs, do `rollapply(dat$motif, 2, c, by = 2)` instead

Take a look at the docs for rollapply if this isn't quite what you need. For grouping by other variables, you can combine this with one of:

  • base R functions aggregate or by (not recommended), or
  • the *ply functions from plyr (better)
shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • I have grouped my variables into one single id. – Komal Rathi Oct 07 '14 at 21:41
  • What do you mean by disjoint and overlapping pairs? I just want to count the number of times a pair occurs across every group. How do I convert the numbers in the output to motif names? – Komal Rathi Oct 07 '14 at 21:45
  • @KomalRathi if this isn't right, you need to be more clear about what you mean by a "pair". – shadowtalker Oct 07 '14 at 21:50
  • I did not mean to offend you and no way I was implying that you are incorrect. I meant to ask you what are disjoint and overlapping pairs and then I just tried to explain what I need. By pair, I mean each possible pair of motifs. I have also provided an example of what the desired output should look like based on the data that I have shown. In the data that is shown, GATA6 and GATA4 are two motifs. They co-occur in 3 different ids. So the count is 3. Similarly, GATA6 and SRF co-occur in 2 different ids so the count is 2. – Komal Rathi Oct 07 '14 at 21:51
  • No offense taken. I'm assuming you mean *adjacent* pairs, but that's not clear from your post. – shadowtalker Oct 07 '14 at 21:56
  • Given a sequence `(1 2 3 4 5)`, overlapping adjacent pairs would be `(1 2) (2 3) (3 4) (4 5)` whereas disjoint adjacent pairs would be `(1 2) (3 4) (5)`. Which of those are you looking for? – shadowtalker Oct 07 '14 at 21:57
  • No I mean, every possible pair of motif. Not just adjacent pairs. I have shown in the question also. Also if you are unclear, read the big comment that I have written. – Komal Rathi Oct 07 '14 at 21:58
  • Oh, I didn't notice the ID variable. Then this is incredibly easy. I'll edit my post. – shadowtalker Oct 07 '14 at 22:00
  • Grouping by `id` doesn't work either, not unless you do a rollup of all the tables at the end. In OP's example output, the `GATA6` `GATA4` row has count 3 because that is the count of unique `id`s in which `GATA4` and `GATA6` appear together. – Gregor Thomas Oct 07 '14 at 22:56
  • @Gregor I was under the impression OP wanted that table for each `id`. – shadowtalker Oct 08 '14 at 01:06
1

What about this?:

res1<- split(dat$motif,dat$id)
res2<- lapply(res1,function(x) combn(x,2))
res3<- apply(do.call(cbind,res2),2,function(x) paste(x[1],x[2],sep="_"))

table(res3)
DatamineR
  • 10,428
  • 3
  • 25
  • 45