3

I have 2 dataframes:

> access
     V1     V2     V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958

and

> peaks
     V1     V2     V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950

The coloumn V2 and V3 are start and end of regions (range) in both dataframes. I want to keep those rows in peaks dataframe for which access$V1 == peaks$V1 AND which fall in the range (or regions) of access dataframe. For example the new dataframe will be like: peaks dataframe's

  • 1st row region doesn't exist in access dataframe so it will be assigned category U.

  • 2nd row of peaks falls in the given range in access dataframe (2nd row) and will be assigned category B.

  • 3rd row of peaks doesn't completely fall in that region but it somehow overlaps with region in 3rd row of access, for this I will assign category A.

  • 4th row of peaks also doesn't overlap completely at it ends 11 number after the end of region in row 4 of access, this will also be in category A.

  • 5th row falls in the region hence will be in category B.

Expected output:

> newdf   
     V1     V2     V3 V4
1 chr10 126122 126533  U
2 chr10 179450 179730  B
3 chr10 182788 183350  A
4 chr10 224244 224500  A
5 chr10 237695 237950  B

Here are the dput of input dataframes:

> dput(peaks)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"), 
    V2 = c(126122L, 179450L, 182788L, 224244L, 237695L), V3 = c(126533L, 
    179730L, 183350L, 224500L, 237950L)), .Names = c("V1", "V2", 
"V3"), class = "data.frame", row.names = c(NA, -5L))

> dput(access)
    structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"), 
        V2 = c(136122L, 179432L, 182988L, 224234L, 237693L), V3 = c(136533L, 
        179769L, 183371L, 224489L, 237958L)), .Names = c("V1", "V2", 
    "V3"), class = "data.frame", row.names = c(NA, -5L))

Edit:

My new access df looks like this and now I also want to append the last column in my final output df:

> access
     V1     V2     V3  V4
1 chr10 136122 136533  found
2 chr10 179432 179769  notFound
3 chr10 182988 183371  found
4 chr10 224234 224489  found
5 chr10 237693 237958  notFound

So now there is one extra condition which is if row in access falls in peaks range then also append the value in V4 in a new column in final df, if some region is not found then by default will be notFound. Therefore, final output will be:

> newdf   
     V1     V2     V3 V4 V5
1 chr10 126122 126533  U notFound
2 chr10 179450 179730  B notFound
3 chr10 182788 183350  A found
4 chr10 224244 224500  A found
5 chr10 237695 237950  B notFound

Here in row1$V5 the value is notFound because this region was not found and in remaining cases we got the values in V5 from modified access df.

Newbie
  • 411
  • 5
  • 18
  • 1
    Possible duplicate of [roll join with start/end window](http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window) – zx8754 Jul 18 '16 at 09:30
  • @zx8754 The difference is in that question it is just one number being checked if it exist in the range or not, **here I am searching for a range (not a number) with in an other range**. – Newbie Jul 18 '16 at 09:47
  • @zx8754 `newdf <- subset(merge(access, peaks), start <= V2 & V3 <= end)` this is giving me only those which falls in those range, I am unable to figure out how to assign them the categories I defined and also how to keep those which do not completely fall in those ranges. Can you help me with this..? – Newbie Jul 18 '16 at 09:56
  • 1
    This is exactly the case `foverlaps` was developed - with genetics regions overlap in mind. Read about `type` and `nomatch` arguments of the function. – zx8754 Jul 18 '16 at 10:02
  • @zx8754 I have tried but I am unable to do what I want. When I use `type= "all"`, I get rows which are within those ranges and also which overlaps but not exactly within that range, and when I use `type = "within"`, I only get rows which falls within given ranges. So I know I have to use `type= "all"` but I am not getting the idea how I can label those rows as per the categories I defined. Kindly help me in this context. Thank you. – Newbie Jul 18 '16 at 12:44

4 Answers4

1

If speed is a concern, the linked data.table solution will probably be superior, but it can also be achieved in dplyr, but likely much slower:

library(dplyr)
names(access)[2:3] <- c('start', 'end')

bind_cols(peaks[-1], access) %>%
  rowwise() %>% 
  mutate(V4 = if_else(all(V2:V3 %in% start:end), 'B',
                      if_else(any(V2:V3 %in% start:end), 'A',
                              'U')))

Result:

Source: local data frame [5 x 6]
Groups: <by row>

# A tibble: 5 x 6
      V2     V3     V1  start    end    V4
   <int>  <int> <fctr>  <int>  <int> <chr>
1 126122 126533  chr10 136122 136533     U
2 179450 179730  chr10 179432 179769     B
3 182788 183350  chr10 182988 183371     A
4 224244 224500  chr10 224234 224489     A
5 237695 237950  chr10 237693 237958     B
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • which packages are you using here because I am getting following error: `Error in eval(expr, envir, enclos) : could not find function "if_else"` and if I use `ifelse` instead from base then I get multiple warning and the categories for all rows are **U** – Newbie Jul 18 '16 at 10:23
  • `dplyr` version `0.5.0`. `if_else` is not in earlier versions. – Axeman Jul 18 '16 at 10:24
  • I have got an error when I applied it to my real data where no.of rows are different in 2 dataframes: `Error in eval(expr, envir, enclos) : incompatible number of rows (103335, expecting 90)` Can you guide me how I can solve this issue. – Newbie Jul 18 '16 at 11:01
  • Use left_join instead of bind_cols. – Axeman Jul 18 '16 at 11:02
  • Can you please help me with this [question](http://stackoverflow.com/questions/38501840/dataframe-processing) – Newbie Jul 25 '16 at 07:38
1

Though its a long way of doing it. But it gives the desired results.

library(dplyr)
df<-cbind(peaks,access) #merging both df
colnames(df)<-c("pV1","pV2","pV3","aV1","aV2","aV3")
df<-df[c(which(df$pV1==df$aV1)),] # selecting rows with pV1=aV1
# creating U, A, B
U1<-df%>%
   filter(pV2<aV2 & pV3<aV2)%>%
   mutate(V4="U")
U2<-df%>%
  filter(pV2>aV3 & pV3>aV3)%>%
  mutate(V4="U")
B<-df%>%
 filter(pV2>aV2 & pV3<aV3)%>%
   mutate(V4="B")
A1<-df%>%
   filter(pV2>aV2 & pV3>aV3)%>%
   mutate(V4="A")
A2<-df%>%
   filter(pV2<aV2 & pV3<aV3 & pV3>aV2)%>%
   mutate(V4="A")
#merging U, A and B into newdf
newdf<-arrange(rbind(U1,U2,B,A1,A2),pV2)
newdf<-newdf[,-c(4:6)]
newdf
    pV1    pV2    pV3 V4
1 chr10 126122 126533  U
2 chr10 179450 179730  B
3 chr10 182788 183350  A
4 chr10 224244 224500  A
5 chr10 237695 237950  B
rar
  • 894
  • 1
  • 9
  • 24
1

Using the foverlaps function it can be done with the following statement:

setkey(setDT(access),V1,V2,V3)
setkey(setDT(peaks),V1,V2,V3)

access[,V4:= ifelse(!is.na(foverlaps(peaks, access, type="within", which=TRUE)$yid),"B",ifelse(!is.na( foverlaps(peaks, access, type="any", which=TRUE)$yid),"A","U"))]

The way it functions is the following:

  1. I first use the "within" type to determine if there is an exact overlap (so if a range is included in any of the matching access ranges.If it is the case then "B"
  2. If this is not the case, I use the "any" to determine if we have any overlap which practically identifies the values that have a partial overlap since the values with exact overlap were excluded in the previous step. These values get an "A"
  3. The rest gets a "U"
  • Can you please help me with this [question](http://stackoverflow.com/questions/38501840/dataframe-processing) – Newbie Jul 25 '16 at 07:38
0

Here's another (straightforward) solution using the non-equi joins implemented recently and available in the current development version of data.table, v1.9.7. See installation instructions here:

require(data.table) # v1.9.7+
setDT(access)
setDT(peaks)[, V4 := "U"]                              # no overlap
peaks[access, V4 := "A", on=.(V1, V2 <= V3, V3 >= V2)] # any overlap
peaks[access, V4 := "B", on=.(V1, V2 >= V2, V3 <= V3)] # completly within
#       V1     V2     V3 V4
# 1: chr10 126122 126533  U
# 2: chr10 179450 179730  B
# 3: chr10 182788 183350  A
# 4: chr10 224244 224500  A
# 5: chr10 237695 237950  B

Add a new column to peaks which is all "U". Then replace those rows where there's any kind of overlap with "A". That would contain all rows which are also completely "within". Then once again, perform a conditional join, but this time only for completely within, and replace with "B".


Note that the foverlaps() solution would work just fine as well (it also comes from data.table package). But the new non-equi joins fits well with the [.data.table syntax which allows to aggregate/add/update cols while joining.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Kindly confirm if I am not wrong, they are like: `peaks[access, V4 := "A", on=.(peaks$V1, peaks$V2 <= access$V3, peaks$V3 >= access$V2)]` AND `peaks[access, V4 := "B", on=.(peaks$V1, peaks$V2 >= access$V2, peaks$V3 <= access$V3)]` – Newbie Jul 19 '16 at 08:40
  • You got it right. LHS of expression refers to DT on the outside. – Arun Jul 19 '16 at 09:02
  • I have modified the question, see an Edit above, can you please suggest me how can I solve this new problem? Thanks. – Newbie Jul 25 '16 at 11:15
  • I tried this but it is not working: `peaks[access, V5 := access$V4, on=.(peaks$V1, peaks$V2 <= access$V3, peaks$V3 >= access$V2)]` – Newbie Jul 25 '16 at 11:39
  • Replace `access$V4` with `i.V4` (i refers to the first argument, here `access`). Remove all the `...$` in the `on=` argument. Variables can be referred directly by names. `on` argument doesn't understand what `...$` is. [Read the vignettes](https://github.com/Rdatatable/data.table/wiki/Getting-started) to get an idea about data.table's features / philosophy. – Arun Jul 25 '16 at 12:05
  • In the `on` argument, you will be able to do `x.V1 <= i.V1` etc.. where `x` refers the data.table on the outside, and `i` is the first argument, but it's not implemented yet. This is a new feature. – Arun Jul 25 '16 at 12:05
  • In your answer can you modify how can I select those bins which are completely overlapping with my regions (not within, but completely overlapping as they are bigger). For example my bin is from 200-400 and overlapping region is 150-450. So how can I assign them "B"? – Newbie Sep 27 '16 at 09:04
  • Can you please suggest me a solution for this problem? Thanks. https://stackoverflow.com/questions/48187739/r-processing-the-input-file-based-on-range-overlap – Newbie Jan 10 '18 at 13:07