1

I have two dataframes of different sizes:

df1<-data.frame(Chr = c(1, 1,2,3,4),
                Start = c(15,120, 210,210,450),
                End = c(15,130, 210,210,450),
                Gene=c("gene1","gene2","gene3","gene3","gene3"),
                sample_id=c("ss6","ss7","ss9","ss9","ss10"))
      
  df2 <- data.frame(Chr = c(1, 1,3),
                    Start = c(10,100, 200),
                    End = c(50,200, 250),
                    Gene=c("gene1","gene2","gene3"),
                    sample_id=c("ss1","ss1","ss1"))

I would like to take the Start from df1 and check to see if it is between the range of Start-End of df2 whilst at the same time making sure the Chr is the same (the sample_id does not have to match). If it is then add a column to df1 ideally with df2$sample_id but if this is not possible then YES (or NA for no match). It is similar to this question but I also need to match 'Chr' Only checking range

It is also similar to this question and I know it should be easier as I don't want to match respective rows Check if column value is in between (range) of two other column values

I have tried:

df1 %>%
  mutate(no_coverage_in = case_when(df2$Start <= Start  & df2$End >=Start & Chr == df2$Chr ~ df2$sample_id ))

But it complains

longer object length is not a multiple of shorter object length

jay.sf
  • 60,139
  • 8
  • 53
  • 110
zw_nz
  • 91
  • 7
  • Yes, I didn't quite ask my question correctly so thought it better to delete and rephrase properly than waste peoples time – zw_nz Sep 09 '21 at 04:53
  • Are you open to *data.table* answers? `df1[df2, on=.(Chr, Start >= Start, Start <= End), hit := i.sample_id]` or something similar. – thelatemail Sep 09 '21 at 04:56
  • 1
    Gene should match too? – Daman deep Sep 09 '21 at 04:56
  • I get: unused argument (on = .(Chr, Start >= Start, Start <= End)). The gene will match if the other conditions are met. – zw_nz Sep 10 '21 at 05:10

4 Answers4

1

Is this what you desire?

Given data frames
> df1
  Chr Start End  Gene sample_id
1   1    15  15 gene1       ss6
2   1   120 130 gene2       ss7
3   2   210 210 gene3       ss9
4   3   210 210 gene3       ss9
5   4   450 450 gene3      ss10
> df2
  Chr Start End  Gene sample_id
1   1    10  50 gene1       ss1
2   1   100 200 gene2       ss1
3   3   200 250 gene3       ss1

vec2 <- c()
for (k in 1:nrow(df1)) {
  if (df1$Chr[k] %in% df2$Chr)  {
    vec <- which(df2$Chr==df1$Chr[k])  
    for (m in 1:length(vec)) {
        if (df1$Start[k]<df2$Start[m] &df1$End[k]<df2$End[m]) {
          vec2[k] <- "Yes"
          
        }else{
          vec2[k] <- "No"
        }
    }
  }else{
    vec2[k] <- "No"
  }
}
df1$Results <- vec2

output

> df1
  Chr Start End  Gene sample_id Results
1   1    15  15 gene1       ss6     Yes
2   1   120 130 gene2       ss7      No
3   2   210 210 gene3       ss9      No
4   3   210 210 gene3       ss9      No
5   4   450 450 gene3      ss10      No
Daman deep
  • 631
  • 3
  • 14
  • This nearly works...row 2 and row 4 should also be "Yes". I changed the second if statement to: if (df1$Start[k]>df2$Start[m] &df1$End[k]) but this only produces a "Yes" for row 2 – zw_nz Sep 10 '21 at 05:07
1

I believe this gives you your desired result:


df1 %>%
  left_join(df2 %>% rename_at(vars(Start, End, sample_id), paste0, "_2")) %>%
  mutate(sample_id_new = case_when(Start < End_2 & Start > Start_2 ~ sample_id_2)) %>% 
  select(Chr, Start, End, Gene, sample_id, sample_id_new)

Output:

  Chr Start End  Gene sample_id sample_id_new
1   1    15  15 gene1       ss6           ss1
2   1   120 130 gene2       ss7           ss1
3   2   210 210 gene3       ss9          <NA>
4   3   210 210 gene3       ss9           ss1
5   4   450 450 gene3      ss10          <NA>

  • I get: unexpected token '>' – zw_nz Sep 10 '21 at 03:44
  • Try this now. I originally had a native pipe operator `|>` which may have thrown an error if you are using an older version of R. Let me know if it doesn't work as I am intrigued. – Freddie J. Heather Sep 10 '21 at 04:46
  • This now works thank you. I don't want to duplicate the rows though - have a huge file as it is. Is there a way of not duplicating? – zw_nz Sep 10 '21 at 05:04
  • I have now modified the code, the replicates were due to not matching the `Gene` but i therefore assume you do also want to match `Gene`. I have therefore adapted the code. Let me know if this is the answer you were looking for. – Freddie J. Heather Sep 12 '21 at 06:50
  • @ Freddie J. Heather This works great - thank you – zw_nz Sep 12 '21 at 22:33
1

You could write a small FUNction that does the checks for each row of df1 and put it in an lapply that loops over its rows.

FUN <- \(x, y) {
  rng <- df1[x, 2] >= y[, 2] & df1[x, 3] < y[, 3]
  chr <- df1[x, 1] == y[, 1]
  if (any(rng & chr)) df2[which(rng), 5] else NA
}

df1 <- transform(df1, match=unlist(lapply(seq.int(nrow(df1)), FUN, df2)))
df1
#   Chr Start End  Gene sample_id match
# 1   1    15  15 gene1       ss6   ss1
# 2   1   120 130 gene2       ss7   ss1
# 3   2   210 210 gene3       ss9  <NA>
# 4   3   210 210 gene3       ss9   ss1
# 5   4   450 450 gene3      ss10  <NA>

Note:

I used the new shorthand notation for creating functions in R>4.1.*. For older R versions, instead of FUN <- \(x, y), use FUN <- function(x, y) or update R.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
1

here is a suggestion.

  df1$match= sapply( 1:nrow(df1) , 
                     function(x)   
                          any(  df1[x, 'Chr']==df2[, 'Chr'] &
                                df1[x , 'Start'] <= df2[ , 'End'] & 
                                df1[x , 'Start'] >= df2[ , 'Start'] ))
SBMVNO
  • 582
  • 3
  • 13