10

I have a data frame that looks like this (Dataframe X):

id  number  found
1   5225    NA
2   2222    NA
3   3121    NA

I have another data frame that looks like this (Dataframe Y):

id  number1  number2    
1   4000     6000
3   2500     3300
3   7000     8000

What I want to do is this: For each value in the Dataframe X "number" column, search if it is equal to or between ANY of the "number1" and "number2" pair values of Dataframe Y. Additionally, for this "number1" and "number2" pair values, its respective "id" must match the "id" in Dataframe X. If this is all true, then I want to insert a "YES in the "found" column of the respective row in Dataframe X:

id  number  found
1   5225    YES
2   2222    NA
3   3121    YES

How would I go about doing this? Thanks for the help.

Tony
  • 129
  • 1
  • 1
  • 8
  • 1
    Related work by David over here - https://stackoverflow.com/questions/41132081/find-which-interval-row-in-a-data-frame-that-each-element-of-a-vector-belongs-in/41132376 – thelatemail Oct 19 '18 at 04:06

4 Answers4

7

Using tidyverse functions, especially map_chr to iterate over each number:

library(tidyverse)
tbl1 <- read_table2(
"id   number  found
1    5225     NA
2    2222     NA
3    3121     NA"
)
tbl2 <- read_table2(
"id  number1  number2
1    4000   6000
2    2500   3300
3    7000   8000"
)

tbl1 %>%
  mutate(found = map_chr(
    .x = number,
    .f = ~ if_else(
      condition = any(.x > tbl2$number1 & .x < tbl2$number2),
      true = "YES",
      false = NA_character_
    )
  ))
#> # A tibble: 3 x 3
#>      id number found
#>   <int>  <int> <chr>
#> 1     1   5225 YES  
#> 2     2   2222 <NA> 
#> 3     3   3121 YES

Created on 2018-10-18 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • @Tony - Calum is just recreating your data so they have something to work with. You can ignore the `read_table` bits at the start if you already have the data imported. – thelatemail Oct 19 '18 at 04:08
  • If I wanted to ensure that the id's from both tables also matched as well, how would I do so using your solution? – Tony Oct 21 '18 at 01:14
  • I see that you edited the question text but didn't change the supplied data? In what circumstance would they not align (are ids duplicated, missing, or just different between the two?) what should be returned in those cases? – Calum You Oct 21 '18 at 20:12
  • This is great! Is there a way to programmatically select `number1`, and `number2`? Also would it be possible to select a different `tbl2` for each row in `tbl1`, or if I were to add the potential thresholds to each row, can I use just `number1`, and `number2` directly. – Sam Oct 29 '21 at 15:29
7

Here is an option using fuzzy_join

library(fuzzy_join)
library(dplyr)
fuzzy_left_join(X, Y[-1], by = c("number" = "number1", "number" = "number2"), 
     match_fun  =list(`>=`, `<=`)) %>% 
    mutate(found = c(NA, "YES")[(!is.na(number1)) + 1]) %>% 
    select(names(X))
#    id number found
#1  1   5225   YES
#2  2   2222  <NA>
#3  3   3121   YES

Or another option is a non-equi join with data.table

library(data.table)
setDT(X)[, found := NULL]
X[Y, found := "YES", on = .(number >= number1, number <= number2)]
X
#   id number found
#1:  1   5225   YES
#2:  2   2222  <NA>
#3:  3   3121   YES

data

X <- structure(list(id = 1:3, number = c(5225L, 2222L, 3121L), found = c(NA, 
  NA, NA)), class = "data.frame", row.names = c(NA, -3L))

Y <- structure(list(id = 1:3, number1 = c(4000L, 2500L, 7000L), number2 = c(6000L, 
    3300L, 8000L)), class = "data.frame", row.names = c(NA, -3L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • If I wanted to ensure that the id's from both tables also matched as well, how would I do so using your solution? – Tony Oct 21 '18 at 01:14
6

We can loop over each x$number using sapply and check if it lies in range of any of y$number1 and y$number2 and give the value accordingly.

x$found <- ifelse(sapply(x$number, function(p) 
                 any(y$number1 <= p & y$number2 >= p)),"YES", NA)
x

#  id number found
#1  1   5225   YES
#2  2   2222  <NA>
#3  3   3121   YES

Using the same logic but with replace

x$found <- replace(x$found, 
         sapply(x$number, function(p) any(y$number1 <= p & y$number2 >= p)), "YES")

EDIT

If we want to also compare the id value we could do

x$found <- ifelse(sapply(seq_along(x$number), function(i) {
           inds <- y$number1 <= x$number[i] & y$number2 >= x$number[i]
           any(inds) & (x$id[i] == y$id[which.max(inds)])
           }), "YES", NA)

x$found
#[1] "YES" NA    "YES"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • If I wanted to ensure that the id's from both tables also matched as well, how would I do so using your solution? – Tony Oct 21 '18 at 01:14
  • @Tony not quite sure if I understand you correctly. Could you update your question and explain what you are trying to achieve here? – Ronak Shah Oct 21 '18 at 06:20
  • I just updated the question. I want to ensure that the "id" value from each data frame also matches as well. So the "number" value in dataframe X will have to be between any "number1" and "number2" value of dataframe Y, and ALSO the "id" values in each dataframe must match as well for a "YES" to be added to the "found" column. – Tony Oct 21 '18 at 19:13
  • Hi @Ronak. Thanks for your answer, I understood it, except for the last line `any(inds) & (x$id[i] == y$id[which.max(inds)])`. What does this do? I understand that it is matching ids but how is it doing that by taking the max? – Manasi Shah Feb 19 '20 at 19:44
  • @ManasiShah `inds` is a vector of `TRUE`/`FALSE` values, using `which.max` we get the index of first `TRUE` value in it. Check `which.max(c(FALSE, TRUE, TRUE))` So we match `x`s id with the first `id` in `y`. – Ronak Shah Feb 19 '20 at 23:43
  • This is very helpful, thank you @RonakShah. But lets assume the value in the range doesn't have an id and we want to assign the id for the previous value from the y dataset, what function could help us to check that? maybe this question will help https://stackoverflow.com/questions/65098710/r-check-if-binary-values-are-in-between-range-of-a-column – Amal Nasir Dec 02 '20 at 04:14
4

Using sqldf:

library(sqldf)
sql <- "SELECT DISTINCT x.id, x.number, "
sql <- paste0(sql, "CASE WHEN y.id IS NOT NULL THEN 'YES' END AS found ")
sql <- paste0(sql, "FROM X x LEFT JOIN Y y ON x.number BETWEEN y.number1 AND y.number2")
X <- sqldf(sql)

enter image description here

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Note that in R character constants can run over mulitple lines so you don't have to paste the sql statement together from pieces. See, for example, (2) in this answer which is an unrelated question but illustrates an sql statement that runx over multiple lines: https://stackoverflow.com/questions/52797665/multiplying-year-based-vector-with-a-year-month-based-matrix-in-r/52798893#52798893 – G. Grothendieck Oct 19 '18 at 12:56
  • @G.Grothendieck I am coming from a Java/app language background. Good trick to know that a string can run over multiple lines. – Tim Biegeleisen Oct 19 '18 at 13:20