1
          date      county      state cases deaths  FIPS
 1: 2020-01-21   Snohomish Washington     1      0 53061
 2: 2020-01-22   Snohomish Washington     3      0 53061
 3: 2020-01-23   Snohomish Washington     5      1 53061
 4: 2020-01-24        Cook   Illinois     1      0 17031
 5: 2020-01-24   Snohomish Washington     5      1 53061
 6: 2020-01-25      Orange California     1      0  6059
 7: 2020-01-25 Los Angeles   California   2      0  6037
 8: 2020-01-25   Snohomish Washington     5      2 53061
 9: 2020-01-26    Maricopa    Arizona     1      0  4013
10: 2020-01-26 Los Angeles California     17     0  6037
11: 2020-01-27    Maricopa    Arizona     3      1  4013
12: 2020-01-28       Cook    Illinois     2      2  17031

I would like to take the lowest row for each county (aka the most recent data, since the data are organized by date). I would like to delete all old data. Some counties most recent data are in January, some is in March (not shown). My df1 is about 15,000 rows long. How can this be done? Output should be:

          date      county      state cases deaths  FIPS
 6: 2020-01-25      Orange California     1      0  6059
 8: 2020-01-25   Snohomish Washington     5      2 53061
10: 2020-01-26 Los Angeles California     17     0  6037
11: 2020-01-27    Maricopa    Arizona     3      1  4013
12: 2020-01-28       Cook    Illinois     2      2  17031
dww
  • 30,425
  • 5
  • 68
  • 111
Evan
  • 1,477
  • 1
  • 17
  • 34

1 Answers1

1

Here is one option with data.table. Grouped by 'county', 'state', get the index of max 'date' to subset the data.table

library(data.table)
setDT(df1)[, .SD[which.max(date)], .(county, state)]
#       county      state       date cases deaths  FIPS
#1:   Snohomish Washington 2020-01-25     5      2 53061
#2:        Cook   Illinois 2020-01-28     2      2 17031
#3:      Orange California 2020-01-25     1      0  6059
#4: Los Angeles California 2020-01-26    17      0  6037
#5:    Maricopa    Arizona 2020-01-27     3      1  4013

Or a faster option with .I

setDT(df1)[df1[, .I[which.max(date)], .(county, state)]$V1]

Or with slice

library(dplyr)
df1 %>%
    group_by(county, state) %>%
    slice(which.max(date))

Or arrange by 'county', 'state', and 'date' and then get the distinct rows

df1 %>%
     arrange(county, state, desc(date)) %>% 
     distinct(state, county, .keep_all = TRUE)

Or with base R

subset(df1, ave(date, county, state, FUN = max) == date)

Or another option is to order and then use duplicated

df2 <- df1[with(df1, order(county, state, -date)),]
df2[!duplicated(df2[c('county', 'state)]),]

data

df1 <- structure(list(date = structure(c(18282, 18283, 18284, 18285, 
18285, 18286, 18286, 18286, 18287, 18287, 18288, 18289), class = "Date"), 
    county = c("Snohomish", "Snohomish", "Snohomish", "Cook", 
    "Snohomish", "Orange", "Los Angeles", "Snohomish", "Maricopa", 
    "Los Angeles", "Maricopa", "Cook"), state = c("Washington", 
    "Washington", "Washington", "Illinois", "Washington", "California", 
    "California", "Washington", "Arizona", "California", "Arizona", 
    "Illinois"), cases = c(1L, 3L, 5L, 1L, 5L, 1L, 2L, 5L, 1L, 
    17L, 3L, 2L), deaths = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 2L, 
    0L, 0L, 1L, 2L), FIPS = c(53061L, 53061L, 53061L, 17031L, 
    53061L, 6059L, 6037L, 53061L, 4013L, 6037L, 4013L, 17031L
    )), row.names = c("1:", "2:", "3:", "4:", "5:", "6:", "7:", 
"8:", "9:", "10:", "11:", "12:"), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662