0

I am new to data.table, and am trying to switch over. I have 2 data.tables (variable_sites and dt_bam) and want to use variable_sites$POS (call this refPOS) to perform a function using data from dt_bam. To get the variable read_base in the summary table, I want to find a row in dt_bam where refPOS is less than pos + qwidth and extract a character from the string dt_bam$seq based on the difference between refPOS and pos

I have it working for one single value of refPOS but don't really know how to sapply a vector of refPOSs in the data.table syntax. Any help is appreciated.

Here is my code:

dt_bam<-data.table(qname=lst[[1]],rname=lst[[2]],strand=lst[[3]],pos=lst[[4]],qwidth=lst[[5]],cigar=lst[[6]],
                   seq=as.character(lst[[7]]))
refPOS<-1000140 # renamed POS so not to confuse with pos
summ_tab <-  dt_bam[refPOS < pos +qwidth & refPOS >pos,
                    .(locus_pos=refPOS,read_base = substr(seq,abs(refPOS-pos),abs(refPOS-pos)))] 

# sapply(variable_sites[,POS],) then the individual values from variable_sites[POS] become refPOS

expected output, as below but one row for every row in dt1 variable_sites[,POS]:

    refPOS read_base
1: 1000140         C

Here is some sample data:

> head(variable_sites)
    CHR     POS REF
1: chr1 1013855   G
2: chr1 1045080   G
3: chr1 1051873   C
4: chr1 1083795   C
5: chr1 1091327   C
6: chr1 1091421   T    

> head(dt_bam)
                qname rname strand     pos qwidth  cigar
1: SRR709972.27609810  chr1      + 1000135    101   101M
2: SRR709972.27609810  chr1      - 1000145    101   101M
3: SRR709972.23678227  chr1      + 1000545    101 91M10S
4: SRR709972.23678227  chr1      - 1000632    101   101M
5: SRR709972.11643848  chr1      + 1000651    101   101M
6: SRR709972.18299955  chr1      + 1000669    101   101M
                                                                                                     seq
1: GCCGCGGGGTGTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGC
2: GTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGCG
3: CGAGCCTCGGTCTCGAGCCTCTTGGCTTCCTCCGCCCTTCCCCACTCCGGTCCCGGTTTGGGCCCTGCTCTGTCTCCGAGTTTGATCCGACCCCGCCTCGC
4: CGACACCGGCTCGGCCTCCGGGGGTCCCCCCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTG
5: GGGGGTCCCACCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCGGCCGGGTCGGCAGGCG
6: TGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCTGCCGGGTCGGCAGGCGGGAGGGCGGAGTCAGCGG

> dput(head(variable_sites))
setDT(structure(list(CHR = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1"), POS = c(1013855L, 1045080L, 1051873L, 1083795L, 1091327L, 
1091421L), REF = c("G", "G", "C", "C", "C", "T")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame")))
ekoam
  • 8,744
  • 1
  • 9
  • 22
user2814482
  • 631
  • 1
  • 10
  • 28
  • 1
    It's not really clear to me what you're aiming for. Where is `read_base` being used? What are the other columns in `variable_sites` being used? Can you post the `dput` of `dt_bam` as well? What is the expected output? – r2evans Jan 20 '22 at 05:51
  • It seems to me that you want to extract the single character at the position `refPOS - pos` if `pos < refPOS < pos + width` and call it `read_base`. However, you also mentioned that this needs `sapply`, meaning that for each `variable_stes$POS`, you want to do this procedure for **all** values in `dt_ban$pos`? This will create a lot of new columns I suppose and doesn't seem right to me. – ekoam Jan 20 '22 at 05:59
  • @ekoam you are correct in what I want to do. I was thinking to use sapply so that this can be done for every variable_stes$POS. I was hoping to create new rows for summ_tab, not columns ? – user2814482 Jan 20 '22 at 06:10
  • I see. You want to do [_non-equi join_](https://stackoverflow.com/q/41043047/10802499). @user2814482 – ekoam Jan 20 '22 at 06:14
  • yes, I was thinking that it was some sort of conditional merge or join, however, I don't have any common columns. POS in dt1 just has to be between a range in pos pos+width in dt2 – user2814482 Jan 20 '22 at 06:22

1 Answers1

3

This is the data.table approach you are looking for. We create a temporary variable end in dt_bam and then perform a non-equi join. Note that when performing the join, you MUST use x.POS to refer to variable_sites$POS. POS will give you the wrong variable. i.pos/pos/POS all refer to dt_bam$pos, as by default the variable you are joining on (POS in this case) is replaced by the first corresponding variable (pos in this case) in the data.table joined with.

library(data.table)

variable_sites[
  dt_bam[, end:=pos+qwidth], read_base:=substr(seq, x.POS - i.pos, x.POS - i.pos), 
  on = .(POS > pos, POS < end)
]
dt_bam[, end:=NULL]

Output

> variable_sites
    CHR     POS REF read_base
1: chr1 1013855   G      <NA>
2: chr1 1045080   G      <NA>
3: chr1 1051873   C      <NA>
4: chr1 1083795   C      <NA>
5: chr1 1091327   C      <NA>
6: chr1 1091421   T      <NA>
7: chr1 1000140   ?         C

Data

variable_sites <- data.table::setDT(structure(list(CHR = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1"), POS = c(1013855L, 1045080L, 1051873L, 1083795L, 
1091327L, 1091421L, 1000140L), REF = c("G", "G", "C", "C", "C", 
"T", "?")), row.names = c(NA, -7L), class = c("data.table", "data.frame")))

dt_bam <- data.table::setDT(structure(list(qname = c("SRR709972.27609810", "SRR709972.27609810", 
"SRR709972.23678227", "SRR709972.23678227", "SRR709972.11643848", 
"SRR709972.18299955"), rname = c("chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1"), strand = c("+", "-", "+", "-", "+", "+"), pos = c(1000135L, 
1000145L, 1000545L, 1000632L, 1000651L, 1000669L), qwidth = c(101L, 
101L, 101L, 101L, 101L, 101L), cigar = c("101M", "101M", "91M10S", 
"101M", "101M", "101M"), seq = c("GCCGCGGGGTGTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGC", 
"GTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGCG", 
"CGAGCCTCGGTCTCGAGCCTCTTGGCTTCCTCCGCCCTTCCCCACTCCGGTCCCGGTTTGGGCCCTGCTCTGTCTCCGAGTTTGATCCGACCCCGCCTCGC", 
"CGACACCGGCTCGGCCTCCGGGGGTCCCCCCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTG", 
"GGGGGTCCCACCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCGGCCGGGTCGGCAGGCG", 
"TGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCTGCCGGGTCGGCAGGCGGGAGGGCGGAGTCAGCGG"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))

These joins have implicit copying involved.

summ_tab <- variable_sites[dt_bam[, end:=pos+qwidth], .(
  refPOS = x.POS, 
  read_base = substr(seq, x.POS - i.pos, x.POS - i.pos)
), on = .(POS > pos, POS < end), nomatch=NULL]

or

summ_tab <- dt_bam[, end:=pos+qwidth][variable_sites, .(
  refPOS = i.POS, 
  read_base = substr(seq, i.POS - x.pos, i.POS - x.pos)
), on = .(pos < POS, end > POS), nomatch=NULL]

nomatch=NULL drops all rows that cannot be matched against in variable_sites. Removing this switch then the two methods above give you different behaviors. Choose the one you want.


The extended solution to your problem based on our discussion

dt_bam[, c("start", "end") := .(
  pos - qwidth * (strand == "-"), 
  pos + qwidth * (strand == "+")
)]

variable_sites[dt_bam, .(
  refPOS = x.POS, 
  read_base = substr(seq, x.POS - i.start, x.POS - i.start)
), on = .(POS > start, POS < end), nomatch=NULL]

dt_bam[, c("start", "end") := NULL]
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • actually the ":= notation modifies the table variable sites. I want to keep that original table and make a new table called summ_tab. I'm trying to edit the syntax to achieve this but keep getting errors. Sorry new to data.table. – user2814482 Jan 20 '22 at 14:46
  • Then change the first line from `variable_sites[` to `summ_tab <- copy(variable_sites)[`. @user2814482 – ekoam Jan 20 '22 at 14:52
  • thanks, was trying to avoid copy. Also I really want a right outer join as there should be multiple rows for which x.POS is >pos & < end. I thought this was like the x[y] notation but the results are the same number of rows as variable_sites – user2814482 Jan 20 '22 at 15:05
  • @user2814482 Then you cannot avoid copying. The one above avoids copying but that means you need to do the in-place modification. If you want a completely new `summ_tab` and grow the rows, the join will involve copying the data implicitly even if no `copy` is called. – ekoam Jan 20 '22 at 15:12
  • now I"m confused about how to specify the right outer join. this isn't working . sum_tab_small<-copy(dt_bam)[variable_sites[dt_bam[, end:=pos+qwidth], read_base:=substr(seq, x.POS - i.pos, x.POS - i.pos), on = .(POS > pos, POS < end)]] – user2814482 Jan 20 '22 at 15:30
  • I will give you the answer you want. Just take note that copy is implicitly involved. Check the update. @user2814482 – ekoam Jan 20 '22 at 15:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241255/discussion-between-user2814482-and-ekoam). – user2814482 Jan 20 '22 at 16:15
  • Hi @user2814482, check the update if you still need the solution. – ekoam Jan 20 '22 at 19:37
  • update looks great. I am unclear how the * is used by data.table though. Could you explain that. thanks – user2814482 Jan 21 '22 at 04:10
  • That's just multiplication. For example, for the first one, `strand == "-"` gives `TRUE (1)` if `"-"` and `FALSE (0)` if `"+"`. As per your conditions, the beginning position is `pos - qwidth * 0` when `strand = "+"` and `pos - qwidth * 1` when `strand = "-"`, isn't it? @user2814482 – ekoam Jan 21 '22 at 05:02