Sqldf is very slow, how to change this sql code to data.table

Question

I have run r script in R that I have run for about a day using sqldf, now I am trying to use data.table package but still I don't know how to change all this script to work using data.table.

> dput(df[1:4, ])
structure(list(chr = c("chr1", "chr1", "chr1", "chr1"), cpg = c(4222, 
4234, 4235, 4313), count_c = c(0L, 0L, 0L, 2L), total_coverage = c(8L, 
6L, 8L, 8L)), row.names = 12:15, class = "data.frame")
> dput(annotation_with_total_cpgs[1:4, ])
structure(list(gene_id = c("PSOL00004", "PSOL00004", "PSOL00004-TA", 
"PSOL00004-TA"), chr = c("chr5", "chr5", "chr5", "chr5"), start = c(9914646L, 
9914646L, 9914646L, 9914646L), end = c(9917882L, 9917882L, 9914818L, 
9914818L), feature = c("gene", "mRNA", "CDS", "exon"), cpg_count = c(101L, 
101L, 11L, 11L)), row.names = c(NA, 4L), class = "data.frame")

df = read.table("final_coverage.txt", header = T) annotation_with_total_cpgs <- read_table("total_cpgs.txt") 
output <- sqldf("
  SELECT sample.chr, sample.cpg, sample.count_c, 
    sample.total_coverage, annot.chr, annot.start, 
    annot.end, annot.gene_id, annot.cpg_count, annot.feature 
  FROM df AS sample 
    LEFT JOIN annotation_with_total_cpgs AS annot 
      ON sample.chr = annot.chr 
      AND (sample.cpg >= annot.start AND sample.cpg <= annot.end)
") 
output <- output[!is.na(output$gene_id),]

This might help: [Overlap join with start and end positions](https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions) — MrFlick, Mar 10 '22 at 07:20
Trying using `between` instead of two comparisons and perhaps add an index to `cpg`. — G. Grothendieck, Mar 10 '22 at 14:57

score 2 · Answer 1 · answered Mar 10 '22 at 18:50

2

Without sample data, this is a guess, but ...

library(data.table)
setDT(df)
setDT(annotation_with_total_cpgs)
annotation_with_total_cpgs[, c("start0", "end0") := .(start, end)
  ][df[,cpg0 := cpg], on = .(chr, start0 <= cpg0, end0 >= cpg0)
    ][, .(chr, cpg, count_c, total_coverage, start, end, gene_id, cpg_count, feature)]

The reason I create start0, end0, and cpg0 is because data.table-joins tend to overwrite the join variables use in the overlap/non-equi join, and I'm never certain which is which, so I explicitly keep the desired values. If we didn't explicitly subset the columns with .(chr, cpg, ...), then you'd see both start and start0, and would likely want to remove the *0 variables I created.

answered Mar 10 '22 at 18:50

r2evans

141,215
6
77
149

Thanks @r2evans for your answer this the full script df = read.table("final_coverage.txt", header = T) annotation_with_total_cpgs <- read_table("total_cpgs.txt") output <- sqldf("SELECT sample.chr, sample.cpg, sample.count_c, sample.total_coverage, annot.chr, annot.start, annot.end, annot.gene_id, annot.cpg_count, annot.feature FROM df AS sample LEFT JOIN annotation_with_total_cpgs AS annot ON sample.chr = annot.chr AND (sample.cpg >= annot.start AND sample.cpg <= annot.end) ") output <- output[!is.na(output$gene_id),] – Wanees Mar 11 '22 at 16:25
I don't know what I'm supposed to do with that comment, Wanees. Are you saying that my code works? Or are you trying to make your question more reproducible? – r2evans Mar 11 '22 at 16:26
When I run it I got this result and I don't know what is the problem and I am trying to explain but I post incomplete comment chr cpg count_c total_coverage start end gene_id cpg_count feature 1: chr1 260 0 6 NA NA NA 2: chr1 3364 0 6 NA NA NA – Wanees Mar 11 '22 at 16:40
gene_id, cpg_count, feature show after I run your code @r2evans – Wanees Mar 11 '22 at 17:29
1

Well, like I started with, this was speculative. As long as there is no sample data, I don't know that I'm going to dive into why it isn't doing what you want it to do. Please consider making your question *reproducible* including sample data; to do so *best*, please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info, and then use `dput` or similar to provide usable unambiguous, representative data. – r2evans Mar 11 '22 at 18:09
I've added the sample data according to your suggestion, please check – Wanees Mar 12 '22 at 09:57
I see. Having looked at and tested your sample data, the problem is clear: you are trying to join on `start`/`end` from `annotation*`, and the values there are in the range of 9914646+; and join those on `cpg`, whose values are on the order of 4222+. Nothing can be matched, because your conditions of `(sample.cpg >= annot.start AND sample.cpg <= annot.end)` fail to be met 100% of the time. – r2evans Mar 12 '22 at 19:05

score 0 · Answer 2 · answered Mar 12 '22 at 15:38

0

have a look at the foverlaps function from the data.table package

answered Mar 12 '22 at 15:38

V. Lou

159
5

Sqldf is very slow, how to change this sql code to data.table

2 Answers2