-2

I have run r script in R that I have run for about a day using sqldf, now I am trying to use data.table package but still I don't know how to change all this script to work using data.table.

> dput(df[1:4, ])
structure(list(chr = c("chr1", "chr1", "chr1", "chr1"), cpg = c(4222, 
4234, 4235, 4313), count_c = c(0L, 0L, 0L, 2L), total_coverage = c(8L, 
6L, 8L, 8L)), row.names = 12:15, class = "data.frame")
> dput(annotation_with_total_cpgs[1:4, ])
structure(list(gene_id = c("PSOL00004", "PSOL00004", "PSOL00004-TA", 
"PSOL00004-TA"), chr = c("chr5", "chr5", "chr5", "chr5"), start = c(9914646L, 
9914646L, 9914646L, 9914646L), end = c(9917882L, 9917882L, 9914818L, 
9914818L), feature = c("gene", "mRNA", "CDS", "exon"), cpg_count = c(101L, 
101L, 11L, 11L)), row.names = c(NA, 4L), class = "data.frame")

df = read.table("final_coverage.txt", header = T) annotation_with_total_cpgs <- read_table("total_cpgs.txt") 
output <- sqldf("
  SELECT sample.chr, sample.cpg, sample.count_c, 
    sample.total_coverage, annot.chr, annot.start, 
    annot.end, annot.gene_id, annot.cpg_count, annot.feature 
  FROM df AS sample 
    LEFT JOIN annotation_with_total_cpgs AS annot 
      ON sample.chr = annot.chr 
      AND (sample.cpg >= annot.start AND sample.cpg <= annot.end)
") 
output <- output[!is.na(output$gene_id),]
r2evans
  • 141,215
  • 6
  • 77
  • 149
Wanees
  • 1
  • 1

2 Answers2

2

Without sample data, this is a guess, but ...

library(data.table)
setDT(df)
setDT(annotation_with_total_cpgs)
annotation_with_total_cpgs[, c("start0", "end0") := .(start, end)
  ][df[,cpg0 := cpg], on = .(chr, start0 <= cpg0, end0 >= cpg0)
    ][, .(chr, cpg, count_c, total_coverage, start, end, gene_id, cpg_count, feature)]

The reason I create start0, end0, and cpg0 is because data.table-joins tend to overwrite the join variables use in the overlap/non-equi join, and I'm never certain which is which, so I explicitly keep the desired values. If we didn't explicitly subset the columns with .(chr, cpg, ...), then you'd see both start and start0, and would likely want to remove the *0 variables I created.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks @r2evans for your answer this the full script df = read.table("final_coverage.txt", header = T) annotation_with_total_cpgs <- read_table("total_cpgs.txt") output <- sqldf("SELECT sample.chr, sample.cpg, sample.count_c, sample.total_coverage, annot.chr, annot.start, annot.end, annot.gene_id, annot.cpg_count, annot.feature FROM df AS sample LEFT JOIN annotation_with_total_cpgs AS annot ON sample.chr = annot.chr AND (sample.cpg >= annot.start AND sample.cpg <= annot.end) ") output <- output[!is.na(output$gene_id),] – Wanees Mar 11 '22 at 16:25
  • I don't know what I'm supposed to do with that comment, Wanees. Are you saying that my code works? Or are you trying to make your question more reproducible? – r2evans Mar 11 '22 at 16:26
  • When I run it I got this result and I don't know what is the problem and I am trying to explain but I post incomplete comment chr cpg count_c total_coverage start end gene_id cpg_count feature 1: chr1 260 0 6 NA NA NA 2: chr1 3364 0 6 NA NA NA – Wanees Mar 11 '22 at 16:40
  • gene_id, cpg_count, feature show after I run your code @r2evans – Wanees Mar 11 '22 at 17:29
  • 1
    Well, like I started with, this was speculative. As long as there is no sample data, I don't know that I'm going to dive into why it isn't doing what you want it to do. Please consider making your question *reproducible* including sample data; to do so *best*, please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info, and then use `dput` or similar to provide usable unambiguous, representative data. – r2evans Mar 11 '22 at 18:09
  • I've added the sample data according to your suggestion, please check – Wanees Mar 12 '22 at 09:57
  • I see. Having looked at and tested your sample data, the problem is clear: you are trying to join on `start`/`end` from `annotation*`, and the values there are in the range of 9914646+; and join those on `cpg`, whose values are on the order of 4222+. Nothing can be matched, because your conditions of `(sample.cpg >= annot.start AND sample.cpg <= annot.end)` fail to be met 100% of the time. – r2evans Mar 12 '22 at 19:05
0

have a look at the foverlaps function from the data.table package

V. Lou
  • 159
  • 5