0

I would like to subset a data.table by condition "Year". Basically I want the data from the dt that matches a given year, per group. However, some groups do not have a complete time line across all years, and therefore I would like to return the nearest year's data for every group, so there are data for every group present for any year chosen (whether that is exactly the right year, or not).

library(data.table)

# make dummy data
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)), 
x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))

# subset by, e.g., Year == 2015 is fine, but I want a full result for ANY
# year chosen, such as 2012, by using the closest entry in time, per group. 

# Attempt;

y <- 2012

dt[Year == which.min(abs(Year - y)), .SD, by = Group]

Empty data.table (0 rows and 3 cols): Group,x,Year

The result in this example should be;

   Group x Year
1:     A 4 2012
2:     B 7 2013
3:     C 2 2012
4:     D 3 2014
Sam
  • 1,400
  • 13
  • 29
  • How to break ties? E.g., what if you search for 2012 and there is 2011 and 2013 present. Which year should the be returned? – mnist Oct 28 '22 at 13:29
  • let's, for now, work on the premise that there are only years available in the future, i.e. timeline go back and stop, and do not restart – Sam Oct 28 '22 at 13:30
  • Hello, `which.min` returns an index. Do: `dt[, .SD[which.min(abs(Year - y))], by = Group]` – Ric Oct 28 '22 at 13:30
  • 1
    Two things: (1) don't use `require` this way: either use `library` or check the return-value from `require`, see https://stackoverflow.com/a/51263513/3358272, https://yihui.org/en/2014/07/library-vs-require/, https://r-pkgs.org/namespace.html#search-path. (2) Code blocks in the Stack interface use code fences `\`\`\`` as you used, but they must be on lines by themselves, shared with nothing. See my edits: the first code-fence `\`\`\`require(.)` _hid_ that expression, since Stack thought it to be a language-hint (no language named "require"), it is often `\`\`\`r` for R-based questions. – r2evans Oct 28 '22 at 13:33

1 Answers1

2

You are close: the use of which.min(abs(Year - y)) is good, but needs to be within the .SD-subsetting in the j portion.

dt[, .SD[which.min(abs(Year - y)),], Group]
#     Group     x  Year
#    <char> <int> <int>
# 1:      A     5  2012
# 2:      B     4  2013
# 3:      C     8  2012
# 4:      D     5  2014

Reproducible data

set.seed(42)
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)), x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
dt
#      Group     x  Year
#     <char> <int> <int>
#  1:      A     1  2011
#  2:      A     5  2012
#  3:      A     1  2013
#  4:      A     9  2014
#  5:      A    10  2015
#  6:      B     4  2013
#  7:      B     2  2014
#  8:      B    10  2015
#  9:      C     1  2011
# 10:      C     8  2012
# 11:      C     7  2013
# 12:      C     4  2014
# 13:      C     9  2015
# 14:      D     5  2014
# 15:      D     4  2015
y <- 2012
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    amazing! and thanks, yes so close. It's an index return. I shall accept when the timer allows me to. – Sam Oct 28 '22 at 13:32