Remove rows from dataframe where var less than max(var) for each unique date

Question

Apologies if this has been asked but I couldn't find an existing solution.

Suppose a dataframe d as follows

+-------------------------------+
|  date        |  var1  | var2  | 
+-------------------------------+
|  2019/01/01  |  100   | abc   |
|  2019/01/01  |  102   | def   |
|  2019/01/02  |  99    | ghi   |
|  2019/01/02  |  98    | jkl   |
|  2019/01/03  |  100   | mno   |
|  2019/01/04  |  105   | pqr   |
|  2019/01/04  |  98    | stu   |
|  2019/01/04  |  110   | vwx   |
+-----------------------------+

With associated dput()

d <- structure(list(date = structure(c(17897, 17897, 17898, 17898, 
17899, 17900, 17900, 17900), class = "Date"), var1 = c(100, 102, 
99, 98, 100, 105, 98, 110), var2 = structure(1:8, .Label = c("abc", 
"def", "ghi", "jkl", "mno", "pqr", "stu", "vwx"), class = "factor")),      
class = "data.frame", row.names = c(NA, -8L))

I want to remove records from d based on three requirements:

Only one record shall remain for each unique date
The record selected per unique date is max(var1) across all records of the same date in d
I want to keep var2 (and any other columns within the real dataset)

Thus, the valid required output would be

+----------------------------------+
|  Date      |  var1   | var2      | 
+----------------------------------+
|  01/01/19  |  102    | def       |
|  02/01/19  |  99     | ghi       |
|  03/01/19  |  100    | mno       |
|  04/01/19  |  110    | vwx       |
+----------------------------------+

Thank you for any help. Please advise if question could be worded better to make it useful for others.

I think the wording is pretty clear and great that you shared a reproducible sample of your data. — s_baldur, Jan 11 '19 at 13:00
@akrun I am sure, you are aware but FYI I tagged this 18 mins ago https://stackoverflow.com/questions/54146920/how-to-replace-the-first-value-of-each-group/54146945#54146945 Dupe is a dupe irrespective of anybody who answers. :) I don't look at who has answered the question before marking it as duplicate. Let me know if I have marked any dupe incorrectly, I would be happy to correct it. — Ronak Shah, Jan 11 '19 at 13:16
@akrun I see the ~~double standard~~ inconsistency but from what I understood from the discussion this week was that duplicated question should always be marked as duplicate. So the issue/mistake is answering (which I did here without thinking about it). — s_baldur, Jan 11 '19 at 13:16
@snoram I am just saying :=). I have no problem with being tagged as dupe — akrun, Jan 11 '19 at 13:16

s_baldur · Accepted Answer · 2019-01-11T13:10:00.320

2

Using data.table:

library(data.table)
setDT(d)
d[, .SD[which.max(var1)], by = date]

         date var1 var2
1: 2019-01-01  102  def
2: 2019-01-02   99  ghi
3: 2019-01-03  100  mno
4: 2019-01-04  110  vwx

A very raw translation into base R gives:

do.call(
  rbind,
  lapply(
    split(d, d[["date"]]), 
    function(SD) SD[which.max(SD[["var1"]]), ]
  )   
)

edited Jan 11 '19 at 13:10

answered Jan 11 '19 at 12:59

s_baldur

29,441
4
36
69

This appears to work. Before I select as resolved, could you please clarify the portion of your code '.SD' and its function? Thank you – Jan 11 '19 at 13:01
basically `.SD` is another `data.table` with all the variables except, `date`. See more info here: https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r – s_baldur Jan 11 '19 at 13:02
Perfect, thanks for clarifying. Makes sense. – Jan 11 '19 at 13:04
1

For further illustration try `d[, nrow(.SD), by = date]`. – s_baldur Jan 11 '19 at 13:05

score 1 · Answer 2 · answered Jan 11 '19 at 13:04

1

With dplyr

library(dplyr)
d %>%
   group_by(date) %>%
   slice(which.max(var1))

answered Jan 11 '19 at 13:04

akrun

874,273
37
540
662

Remove rows from dataframe where var less than max(var) for each unique date

2 Answers2