1

I have two R data frames, each have a group (one of 25 strings) and a position column (an integer). I would like to check for each entry of data frame A whether there is an entry in data frame B which is part of the same group and the delta between its position integer is smaller than 500. If, so I want to tick that entry in data frame A.

So for example the first entry in A would match the third entry in B (The group is the same and the difference between their positions is smaller than 500 bp). Therefore, it got marked in the output table.

head(A)
  group     pos
1  chr1 3202965
2  chr1 3000168
3  chr1 3000204
4  chr2 3000560
5  chr2 3000674
6  chr3 3000698

head(B)    
  group     pos
1  chr1 3180137
2  chr1 3200918
3  chr1 3202983
4  chr1 3309167
5  chr4 3458278
6  chr1 4249136

A_out <- magic(A,B)

head(A_out)
  group     pos out
1  chr1 3202965   +
2  chr1 3000168   -
3  chr1 3000204   -
4  chr2 3000560   -
5  chr2 3000674   -
6  chr3 3000698   -

My intuition would be a nested loop (first A then B) and check for each entry combination whether it matches. But my data frames a rather big (12052773 and 44459 entries respectively) and this would never finish.

Is there a smarter approach to handle this?

kEks
  • 304
  • 1
  • 9
  • You are essentially attempting to do a non-equi join. See the `data.table` pkg for an efficient implementation. – mbiron Apr 30 '18 at 17:52
  • I think just one loop for each row of A, and some conditions on B using `data.table` would do the trick. For row `i`in A, something like `is.empty(B[group==a_gp & abs(pos-a_pos)<500 ])`... – Rohit Apr 30 '18 at 17:56
  • Personally, I would strongly advice against doing loops, because of their inefficiency in R. Check [this related question in SO](https://stackoverflow.com/q/37289405/5443023) where you can see how your question can be stated in terms of a join. You can see that there are other approaches besides using `data.table`, based on the `dplyr` and `fuzzyjoin` packages. – mbiron Apr 30 '18 at 18:09

3 Answers3

0

Here is one solution using dplyr package:

df1 <- read.table(text = "  group     pos
1  chr1 3202965
2  chr1 3000168
3  chr1 3000204
4  chr2 3000560
5  chr2 3000674
6  chr3 3000698", header = TRUE)

df2 <- read.table(text = "  group     pos
1  chr1 3180137
2  chr1 3200918
3  chr1 3202983
4  chr1 3309167
5  chr4 3458278
6  chr1 4249136", header = TRUE)

df3 <- left_join(df1, df2, by = 'group')

df3 <- df3 %>%
  mutate(out = ifelse(abs(pos.x - pos.y) <= 500 & !is.na(pos.y), '+', '-')) %>%
  select(group, pos = pos.x, out) %>%
  arrange(group, pos, desc(out)) %>%
  group_by(group, pos) %>%
  slice(1)

# A tibble: 6 x 3
# Groups: group, pos [6]
  group     pos out  
  <chr>   <int> <chr>
1 chr1  3000168 -    
2 chr1  3000204 -    
3 chr1  3202965 +    
4 chr2  3000560 -    
5 chr2  3000674 -    
6 chr3  3000698 -    
Gopala
  • 10,363
  • 7
  • 45
  • 77
0

sqldf comes very handy in such cases. One implementation can be as:

library(sqldf)

df_result <- sqldf("select A.*, B.pos as posB from A left outer join B
      on A.group1 = B.group1 AND
      abs(A.pos - B.pos) <= 500")

#The result is available. Any transformation can be performed on the result as
df_result$out <- ifelse(is.na(df_result$posB), "-" , 
             ifelse(df_result$posB > df_result$pos,"+", "-"))
df_result
#   group1     pos    posB out
# 1   chr1 3202965 3202983   +
# 2   chr1 3000168      NA   -
# 3   chr1 3000204      NA   -
# 4   chr2 3000560      NA   -
# 5   chr2 3000674      NA   -
# 6   chr3 3000698      NA   -

Data: Please note that I have changed group column as group1 to make it compatible with sqldf.

A <- read.table(text = 
"group1     pos
1  chr1 3202965
2  chr1 3000168
3  chr1 3000204
4  chr2 3000560
5  chr2 3000674
6  chr3 3000698",
header = TRUE, stringsAsFactors = FALSE)

B <- read.table(text = 
"group1     pos
1  chr1 3180137
2  chr1 3200918
3  chr1 3202983
4  chr1 3309167
5  chr4 3458278
6  chr1 4249136",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • @kEks Good to know thats what you wanted. You can accept one of the answer by clicking on `tick` in left of answer box. – MKR May 03 '18 at 15:16
0

A transform and an ifelse condition to subset the dataframe based on the values of the other should do the trick.

df1 <- read.table(text = "
      group     pos
    1  chr1 3202965
    2  chr1 3000168
    3  chr1 3000204
    4  chr2 3000560
    5  chr2 3000674
    6  chr3 3000698
    ",header = T)

df2 <- read.table(text = "
  group     pos
1  chr1 3180137
2  chr1 3200918
3  chr1 3202983
4  chr1 3309167
5  chr4 3458278
6  chr1 4249136
",header = T)

Transform would go through each row of df1. ifelse condition does create a new column, whose value is based on an ifelse condition, as required. The df1$group %in% df2$group and (df1$pos -df2$pos) > 500) conditions need to be satfied for it to set the out column value to "+".

transform(df1,out = ifelse((df1$group %in% df2$group) & ((df1$pos -df2$pos) > 500), "+","-"))

  group     pos out
1  chr1 3202965   +
2  chr1 3000168   -
3  chr1 3000204   -
4  chr2 3000560   -
5  chr2 3000674   -
6  chr3 3000698   -
JineshEP
  • 738
  • 4
  • 7