0

I would like to use a category from one data frame and apply it to another based on a similar column (merge). But, the merge needs to consider a range of data points that are found between two columns. I have an example below.

set.seed(123)  

df_1 <- tibble(
  x = c(0, 500, 1000, 1500, 2000),
  y = c(499, 999, 1499, 1999, 99999),
  desc = LETTERS[1:5]
)

    > df_1
# A tibble: 5 x 3
      x     y desc 
  <dbl> <dbl> <chr>
1     0   499 A    
2   500   999 B    
3  1000  1499 C    
4  1500  1999 D    
5  2000 99999 E 

df_2 <- tibble(
  code = sample(1:2500,5,F)
)

 >df_2
# A tibble: 5 x 1
       code
      <int>
    1   719
    2  1970
    3  1022
    4  2205
    5  2348

## desired output


df_2 %>% 
  mutate(desc = c('B', 'D', 'C', 'E', 'E'))
# A tibble: 5 x 2
   code desc 
  <int> <chr>
1   719 B    
2  1970 D    
3  1022 C    
4  2205 E    
5  2348 E  

My first thought was to split df_1 and merge somehow, but I'm stuck on how to deal with the range of values found in x and y. Any ideas?

elliot
  • 1,844
  • 16
  • 45
  • 1
    A non equi join or a rolling join would do this, but they don't exist in dplyr yet. Maybe useful: https://stackoverflow.com/q/37289405 or https://github.com/tidyverse/dplyr/issues/2240 – Frank Oct 09 '18 at 15:15

3 Answers3

2

This is an easy problem to handle in SQL, so one option would be to use the sqldf package, with this query:

SELECT t2.code, COALESCE(t1.desc, '') AS desc
FROM df_2 t2
LEFT JOIN df_1 t1
    ON t2.code BETWEEN t1.x AND t1.y;

R code:

library(sqldf)
sql <- paste0("SELECT t2.code, COALESCE(t1.desc, '') AS desc ",
              "FROM df_2 t2 LEFT JOIN df_1 t1 ON t2.code BETWEEN t1.x AND t1.y")
result <- sqldf(sql)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1
library(tidyverse)

set.seed(123)  

df_1 <- tibble(
  x = c(0, 500, 1000, 1500, 2000),
  y = c(499, 999, 1499, 1999, 99999),
  desc = LETTERS[1:5]
)

df_2 <- tibble(
  code = sample(1:2500,5,F)
)


df_1 %>%
  mutate(code = map2(x, y, ~seq(.x, .y, 1))) %>%  # create a sequence of numbers with step = 1
  unnest() %>%                                    # unnest data
  inner_join(df_2, by="code") %>%                 # join df_2 
  select(-x, -y)                                  # remove columns

# # A tibble: 5 x 2
#   desc   code
#   <chr> <dbl>
# 1 B       719
# 2 C      1022
# 3 D      1970
# 4 E      2205
# 5 E      2348
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
0

This seems to work, but is not very tidyverse-ish:

df_2 %>% mutate(v = with(df_1, desc[ findInterval(code, x) ]))

  code v
1  719 B
2 1970 D
3 1022 C
4 2205 E
5 2348 E

This only uses the x column, so the assumption is that there are no gaps in the ranges (y is always one below the next x).

Frank
  • 66,179
  • 8
  • 96
  • 180