2

I have derived all the start and stop positions within a DNA string and now I would like to map each start position with each stop position, both of which are vectors and then use these positions to extract corresponding sub strings from the DNA string sequence. But I am unable to efficiently loop through both vectors to achieve this, especially as they are not of the same length.

I have tried different versions of loops (for, ifelse) but I am not quite able to wrap my head around a solution yet.

Here is an example of one of my several attempts at solve this problem.

new = data.frame()
for (i in start_pos){
  for (j in stop_pos){
    while (j>i){
      new[j,1]=i
      new[j,2]=j
    }
     }
}

Here is an example of my desired result: start = c(1,5,7, 9, 15) stop = c(4, 13, 20, 30, 40, 50). My desired result would ideally be a dataframe of two columns mapping each start to its stop position. I only want to add rows on to df where by start values are greater than its corresponding stop values (multiple start values can have same stop values as long as it fulfills this criteria)as shown in my example below.

 i.e first row df= (1,4)
    second row df= (5,13)
    third row df = (7, 13 )
    fourth row df = (9,13)
    fifth row df =  (15, 20)
B_bunny
  • 35
  • 1
  • 6
  • 2
    Since you have more stops than starts in your example,should the extra stops just be ignored and not present in the data at all? – Marius Apr 02 '19 at 04:46
  • What happens to those with no match? – NelsonGon Apr 02 '19 at 04:48
  • 1
    Hello Marius,.Ideally, I would like the extra stops to be ignored. However, I am hesitant to do this since I am writing this script in such a way that it can efficiently work on any DNA string presented. – B_bunny Apr 02 '19 at 04:49
  • 1
    can starts be more than stops ? What happens in those cases? Is stop shared? – Ronak Shah Apr 02 '19 at 04:50
  • @ronak, yes start can be more than stops. Although I only want to match each start value with the next corresponding stop value greater than the start. Yes stops can be shared as long as both starts are less than the stop value being shared. I hope this makes sense. – B_bunny Apr 02 '19 at 05:17
  • This looks like a fuzzy match. Why should 9 be matched with 13?! – NelsonGon Apr 02 '19 at 05:42
  • I'm actually going to suggest this is essentially a duplicate of - https://stackoverflow.com/questions/20133344/find-closest-value-in-a-vector-with-binary-search – thelatemail Apr 02 '19 at 05:49
  • 1
    @Nelson, I understand how this can be a bit fuzzy. My script however is also based on biology. As mentioned in my question, I derived these positions based on start and stop positions within a DNA sequence. Each start value is allowed to move along the DNA sequence till it is terminated by a stop value. – B_bunny Apr 02 '19 at 05:53

2 Answers2

1

Here is a possible tidyverse solution:

library(purrr)
library(plyr)
library(dplyr)

The map2 is used to map values of the two vectors(start and stop). We then make one vector out of these followed by unlisting and combining our results into a data.frame object.

EDIT: With the updated condition, we can do something like:

start1= c(118,220, 255) 
stop1 =c(115,210,260)
res<-purrr::map2(start1[1:length(stop1)],stop1,function(x,y) c(x,y[y>x]))
res[unlist(lapply(res,function(x) length(x)>1))]
   # [[1]]
   # [1] 255 260

ORIGINAL:

plyr::ldply(purrr::map2(start[1:length(stop)],stop,function(x,y) c(x,y)),unlist) %>% 
   setNames(nm=c("start","stop")) %>% 
 mutate(newCol=paste0("(",start,",",stop,")"))
#  start stop  newCol
#1     1    4   (1,4)
#2     5   13  (5,13)
#3    15   20 (15,20)
#4    NA   30 (NA,30)
#5    NA   40 (NA,40)
#6    NA   50 (NA,50)

Alternative: A clever way is shown by @Marius .The key is to have corresponding lengths.

plyr::ldply(purrr::map2(start,stop[1:length(start)],function(x,y) c(x,y)),unlist) %>% 
   setNames(nm=c("start","stop")) %>% 
 mutate(newCol=paste0("(",start,",",stop,")"))
  start stop  newCol
1     1    4   (1,4)
2     5   13  (5,13)
3    15   20 (15,20)
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    Hello Nelson, thank you for your response. Would it be possible to get a brief explanation of what purr::map2 is doing? Also this only worked for the first 4 rows of my data, after which the stop positions were less than its corresponding start positions. In other words, I only want to map each start with the next stop greater than the start. – B_bunny Apr 02 '19 at 05:07
  • Added an explanation. Could you elaborate "after which the stop positions were less than its corresponding start positions" further? – NelsonGon Apr 02 '19 at 05:12
  • @B_bunny I have edited to add an alternative. It's important to have the same lengths. – NelsonGon Apr 02 '19 at 05:18
  • Yes sure. For example. start = c(118,220, 255) . stop =c(115,210,260). The only valid row to be added in my df = (255, 260). – B_bunny Apr 02 '19 at 05:22
  • This is different. You should edit your question adding this new condition. That is only add them if the start is less than the stop. – NelsonGon Apr 02 '19 at 05:24
  • Okay, will do that. Thank you. – B_bunny Apr 02 '19 at 05:28
1

Here's a fairly simple solution - it's probably good not to over-complicate things unless you're sure you need the extra complexity. The starts and stops already seem to be matched up, you just might have more of one than the other, so you can find the length of the shortest vector and only use that many items from start and stop:

start = c(1, 5, 15) 
stop = c(4, 13, 20, 30, 40, 50)

min_length = min(length(start), length(stop))

df = data.frame(
    start = start[1:min_length],
    stop = stop[1:min_length]
)

EDIT: after reading some of your comments here, it looks like your problem actually is more complicated than it first seemed (coming up with examples that demonstrate the level of complexity you need, without being overly complex, is always tricky). If you want to match each start with the next stop that's greater than the start, you can do:

# Slightly modified example: multiple starts
#   that can be matched with one stop
start = c(1, 5, 8)
stop = c(4, 13, 20, 30, 40, 50)

df2 = data.frame(
    start = start,
    stop = sapply(start, function(s) { min(stop[stop > s]) })
)
Marius
  • 58,213
  • 16
  • 107
  • 105
  • This is also `findInterval` - `stop[findInterval(start, stop) + 1]` , as borrowed from https://stackoverflow.com/questions/20133344/find-closest-value-in-a-vector-with-binary-search and which should be fast. – thelatemail Apr 02 '19 at 05:44