Return a subset based on a logical comparison against input df

Question

Using R, I am trying to return a single row for each ID, filtering through targets in a df_filter, or the next lowest integer based on the filter.

original df:

the filter:

df_filter:
id   year       
1    2017  
2    2018
3    2000

the result data frame should look like this:

dfnew:
id   year
1    2005
2    2017
3    1998

Welcome to Stack! The thread [Extract row corresponding to minimum value of a variable by group](https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group) presents numerous different approaches and some benchmarks on many of them for comparison purposes. — Russ Thomas, Jul 03 '19 at 02:35
apologies! the purpose is, "given a specific year for the id, what was the most recent previous year the id occurred?" — Calvin Yu, Jul 03 '19 at 03:04

Ronak Shah · Answer 1 · 2019-07-03T03:24:01.740

2

Using dplyr we can left_join df and df_filter by id, group_by id, arrange year in descending order and select the first row when the difference between two years is less than 0.

library(dplyr)

left_join(df, df_filter, by = "id") %>%
   arrange(desc(year.x)) %>%
   group_by(id) %>%
   slice(which.max((year.x - year.y[1L]) < 0L)) %>%
   select(year.x)


#     id year.x
#   <int>  <int>
#1     1   2005
#2     2   2017
#3     3   1998

edited Jul 03 '19 at 03:24

answered Jul 03 '19 at 02:22

Ronak Shah

377,200
20
156
213

this is very close! i have updated the df_filter for id "1" to be 2017. the corresponding result in dfnew has been updated. instead of returning the "closest row", i want to return the next highest integer. i think there is slightly different logic, but also a small requirement of code... – Calvin Yu Jul 03 '19 at 02:59

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

We can use data.table

library(data.table)
setDT(df)[df_filter, on = .(id)][year != i.year, 
      .(year = year[which(year  < i.year)[1]]), id]
#   id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

Or using non-equi join

setDT(df)[, year1 := year][df_filter, .(id, year), 
         on = .(id, year1 < year), mult = 'first']
#    id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

Or without assigning (:=) in the original dataset

setDT(df)[, .(year1 = year, year, id)][df_filter, .(id, year),
        on = .(id, year1 < year), mult = 'first']

Or as @thelatemail commented

setDT(df)[df_filter, on=.(id, year < year), .(yearM = max(x.year)),
          by=.EACHI][, .(id, year = yearM)]

Or using tidyverse with fuzzyjoin

library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df, df_filter, by = c("id", "year"),
       match_fun = list(`==`, `<`)) %>% 
  group_by(id = id.x) %>%
  summarise(year = year.x[which(year.x < year.y)[1]])
# A tibble: 3 x 2
#     id year
#  <int> <int>
#1     1  2005
#2     2  2017
#3     3  1998

data

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
    year = c(2019L, 2018L, 2005L, 2004L, 2018L, 2017L, 1998L, 
    1997L, 1996L, 1995L)), class = "data.frame", row.names = c(NA, 
-10L))
df_filter <- structure(list(id = 1:3, year = c(2017L, 2018L, 2000L)), 
   class = "data.frame", row.names = c(NA, -3L))

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 03 '19 at 02:49

akrun

874,273
37
540
662

@akrun, can you explain the advantage of using structure to create your df dataframe instead of just doing something like id <- c(1,1,1,1,2,2,3,3,3,3); year <- c(2019,2018,2005,2004,2018,2017,1998,1997,1996,1995) ; df <- data.frame(id,year)? Asking so I can learn, not questioning your use! – Russ Thomas Jul 03 '19 at 03:24
1

@RussThomas That is a little trick by `dput` function. You created the `df`, and use `dput(df)`, gets the exact structure that you are using. It is easier for reproducibility. In SO, many times, users ask questions with images or just some format of the data. But, once we create a code and test, it may not work with their original data because the type may be a problem (having `factor` insstead of `numeric` when taking the `sum`) – akrun Jul 03 '19 at 03:25
What about a rolling join? - `df_filter[df, on=c("id","year"), roll=-Inf, nomatch=0L][, .SD[1], by=id]` - no idea how it compares speed-wise but it took me a while to figure out so thought I'd share. – thelatemail Jul 03 '19 at 05:25
2

I've gone down the rabbit hole here a bit - also `df[df_filter, on=c("id","year – thelatemail Jul 03 '19 at 05:46
@thelatemail That is nice. I also tried it but couldn't fix it the `x.` part – akrun Jul 03 '19 at 05:49

Return a subset based on a logical comparison against input df

2 Answers2

data