2

Using R, I am trying to return a single row for each ID, filtering through targets in a df_filter, or the next lowest integer based on the filter.

original df:

df:
id   year       
1    2019    
1    2018   
1    2005   
1    2004    
2    2018   
2    2017   
3    1998  
3    1997
3    1996
3    1995

the filter:

df_filter:
id   year       
1    2017  
2    2018
3    2000

the result data frame should look like this:

dfnew:
id   year
1    2005
2    2017
3    1998
Calvin Yu
  • 23
  • 4
  • Welcome to Stack! The thread [Extract row corresponding to minimum value of a variable by group](https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group) presents numerous different approaches and some benchmarks on many of them for comparison purposes. – Russ Thomas Jul 03 '19 at 02:35
  • Your new output is not clear abiout the logic – akrun Jul 03 '19 at 03:01
  • apologies! the purpose is, "given a specific year for the id, what was the most recent previous year the id occurred?" – Calvin Yu Jul 03 '19 at 03:04

2 Answers2

2

Using dplyr we can left_join df and df_filter by id, group_by id, arrange year in descending order and select the first row when the difference between two years is less than 0.

library(dplyr)

left_join(df, df_filter, by = "id") %>%
   arrange(desc(year.x)) %>%
   group_by(id) %>%
   slice(which.max((year.x - year.y[1L]) < 0L)) %>%
   select(year.x)


#     id year.x
#   <int>  <int>
#1     1   2005
#2     2   2017
#3     3   1998
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • this is very close! i have updated the df_filter for id "1" to be 2017. the corresponding result in dfnew has been updated. instead of returning the "closest row", i want to return the next highest integer. i think there is slightly different logic, but also a small requirement of code... – Calvin Yu Jul 03 '19 at 02:59
2

We can use data.table

library(data.table)
setDT(df)[df_filter, on = .(id)][year != i.year, 
      .(year = year[which(year  < i.year)[1]]), id]
#   id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

Or using non-equi join

setDT(df)[, year1 := year][df_filter, .(id, year), 
         on = .(id, year1 < year), mult = 'first']
#    id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

Or without assigning (:=) in the original dataset

setDT(df)[, .(year1 = year, year, id)][df_filter, .(id, year),
        on = .(id, year1 < year), mult = 'first']

Or as @thelatemail commented

setDT(df)[df_filter, on=.(id, year < year), .(yearM = max(x.year)),
          by=.EACHI][, .(id, year = yearM)]

Or using tidyverse with fuzzyjoin

library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df, df_filter, by = c("id", "year"),
       match_fun = list(`==`, `<`)) %>% 
  group_by(id = id.x) %>%
  summarise(year = year.x[which(year.x < year.y)[1]])
# A tibble: 3 x 2
#     id year
#  <int> <int>
#1     1  2005
#2     2  2017
#3     3  1998

data

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
    year = c(2019L, 2018L, 2005L, 2004L, 2018L, 2017L, 1998L, 
    1997L, 1996L, 1995L)), class = "data.frame", row.names = c(NA, 
-10L))
df_filter <- structure(list(id = 1:3, year = c(2017L, 2018L, 2000L)), 
   class = "data.frame", row.names = c(NA, -3L))
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @akrun, can you explain the advantage of using structure to create your df dataframe instead of just doing something like id <- c(1,1,1,1,2,2,3,3,3,3); year <- c(2019,2018,2005,2004,2018,2017,1998,1997,1996,1995) ; df <- data.frame(id,year)? Asking so I can learn, not questioning your use! – Russ Thomas Jul 03 '19 at 03:24
  • 1
    @RussThomas That is a little trick by `dput` function. You created the `df`, and use `dput(df)`, gets the exact structure that you are using. It is easier for reproducibility. In SO, many times, users ask questions with images or just some format of the data. But, once we create a code and test, it may not work with their original data because the type may be a problem (having `factor` insstead of `numeric` when taking the `sum`) – akrun Jul 03 '19 at 03:25
  • What about a rolling join? - `df_filter[df, on=c("id","year"), roll=-Inf, nomatch=0L][, .SD[1], by=id]` - no idea how it compares speed-wise but it took me a while to figure out so thought I'd share. – thelatemail Jul 03 '19 at 05:25
  • 2
    I've gone down the rabbit hole here a bit - also `df[df_filter, on=c("id","year – thelatemail Jul 03 '19 at 05:46
  • @thelatemail That is nice. I also tried it but couldn't fix it the `x.` part – akrun Jul 03 '19 at 05:49