Using dplyr in order to count number of flanking regions with coordonates

Question

Hello I have a dataframe such as

COL1 start end Category
A    30    70  Cat1
A    10    20  Cat2
A    90    300 Cat2  
A    12    26  Cat2
A    72    145 Cat2
B    71    145 Cat2
B    250   350 Cut3
B    355   600 Cat2

So here I'm looking for a code to count the number of df$Category=="Cat1" that have flanking df$Category=="Cat2" values, and this flanking regions must be < 5

So let's take and exemple, for each df$COL1 and each df$Category I count the number of flanking Cat2:

Here

COL1 start end  Category
A    30    70   Cat1

so I'm looking for Cat2 with a start !< 25 and end !> 75, when I look into the df I see that there are :

A    10    20  Cat2       <- this one is too faraway (-10) from 30
A    90    300 Cat2       <- this one is too faraway (+30) from 70
A    72    145 Cat2       <- this one is ok since 72 is just +2 faraway from 70
A    12    26  Cat2       <- this one is ok since 26 is just -4 faraway from 30

So I add a count into a table such as :

New_df

COL1 Nb_flanking
A    2

Then I do the same for df$COL1 ==B :

COL1 start end Category
B    250   350 Cut3

I'm looking for Cat2 with a start !< 245 and end !> 355, when I look into the df I see that there are :

B    71    145 Cat2    <- this one is too faraway (-105) from 250
B    355   600 Cat2    <- this one is ok since 345 is just +5 faraway from 350

Then I fill the New_df

COL1 Nb_flanking
A    2
B    1

and so on and so on...

Here are the data

structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("A", "B"), class = "factor"), start = c(30L, 
10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L, 
26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L, 
2L, 2L, 2L, 2L, 3L, 2L), .Label = c("Cat1", "Cat2", "Cut3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-8L))

Thank you very much for your help and time.

Should be straightforward with `data.table` non-equi join or `foverlaps`. See [Overlap join with start and end positions](https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions) — Henrik, Feb 12 '21 at 10:45
But i'm not looking for overlaps here but more flankins regions — chippycentra, Feb 12 '21 at 10:49
Indeed, but you may e.g. subtract 5 from the 'start' and add 5 to 'end', and then check for overlaps / `<` / `>` of the ranges. — Henrik, Feb 12 '21 at 10:56

score 1 · Answer 1 · answered Feb 12 '21 at 11:15

You can do something like this, but using for loop may not be the best solution

library(tidyverse)

df <- structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
                                  2L), .Label = c("A", "B"), class = "factor"), start = c(30L, 
                                                                                          10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L, 
                                                                                                                                        26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L, 
                                                                                                                                                                                             2L, 2L, 2L, 2L, 1L, 2L), .Label = c("Cat1", "Cat2"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                          -8L)) 
df %>% 
  as_tibble() -> df

df %>% 
  filter(Category == "Cat1") -> df1

df %>% 
  filter(!Category == "Cat1") -> df2


count_function <- function(start, end, df = df2){
  start-5 -> min
  end+5 -> max
  
  n = 0
  
  for(i in 1:nrow(df)){
    df %>% 
      slice(i) -> y
    
    if(y$start > min & y$start < max){
      n <- n + 1
    } else if(y$end > min & y$end < max){
      n <- n + 1
    } 
  }
  
  n
  
  
}

df1 %>% 
  mutate(Nb_flanking_Cat2 = map2_dbl(start, end, count_function))
#> # A tibble: 2 x 5
#>   COL1  start   end Category Nb_flanking_Cat2
#>   <fct> <int> <int> <fct>               <dbl>
#> 1 A        30    70 Cat1                    3
#> 2 B       250   350 Cat1                    1

^{Created on 2021-02-12 by the reprex package (v0.3.0)}

score 1 · Accepted Answer · answered Feb 12 '21 at 15:07

If you don't mind going the data.table way, it can be easily done with a non-equi join.

Some steps can be simplified but I kept the for clarity.

library(data.table)
    dset <- data.table(structure(
        list(
            COL1 = structure(
                c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
                  2L),
                .Label = c("A", "B"),
                class = "factor"
            ),
            start = c(30L,
                      10L, 90L, 12L, 72L, 71L, 250L, 355L),
            end = c(70L, 20L, 300L,
                    26L, 145L, 145L, 350L, 600L),
            Category = structure(
                c(1L, 2L,
                  2L, 2L, 2L, 2L, 3L, 2L),
                .Label = c("Cat1", "Cat2", "Cut3"),
                class = "factor"
            )
        ),
        class = "data.frame",
        row.names = c(NA, -8L)
    ))
    # Split Cat 1 & Cat2
    ds1 <- dset[Category != "Cat2"]
    ds2 <- dset[Category == "Cat2"]
    
    # The distance to flank
    dto_flank <- 5
    ds1[, start := start - dto_flank]
    ds1[, end := end + dto_flank]
    
    # right join between ds1 and ds2
    rj <-
        ds2[ds1, .(x.start, i.start, x.end, i.end, x.COL1, i.COL1), , on = .(COL1 = COL1, start <= end , end >= start)]
    New_df <-
        rj[, .(Nb_flanking = sum(!is.na(x.start))) , .(COL1 = i.COL1)]
    
    New_df
#>    COL1 Nb_flanking
#> 1:    A           2
#> 2:    B           1

^{Created on 2021-02-12 by the reprex package (v0.3.0)}

Using dplyr in order to count number of flanking regions with coordonates

2 Answers2