Deleting duplicated rows based on condition (position)

Question

I have a dataset that looks something like this

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year. the final dataset should look something like this:

final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"), 
                    "Year" = c(1970,1971,1980,1981,1982), 
                    "Val" = c(-2,5,5,3,5))

Does anyone know how I can approach the problem?

Thanks a lot in advance for your help

[Select first and last row from grouped data](https://stackoverflow.com/questions/31528981/select-first-and-last-row-from-grouped-data), [How to select the first and last row within a grouping variable in a data frame?](https://stackoverflow.com/questions/8203818/how-to-select-the-first-and-last-row-within-a-grouping-variable-in-a-data-frame) — Henrik, Sep 29 '19 at 16:05

score 2 · Answer 1 · answered Sep 29 '19 at 15:39

If you are open to a data.table solution, this can be done quite concisely:

library(data.table)

setDT(df)[, .SD[.N], by = c("id", "Year")]
#>       id Year Val
#> 1: Alpha 1970  -2
#> 2: Alpha 1971   5
#> 3:  Beta 1980   5
#> 4:  Beta 1981   3
#> 5:  Beta 1982   5

by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.

DHW · Accepted Answer · 2019-09-29T16:24:12.987

1

How about this?

library(tidyverse)

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

final <- 
  df %>% 
  group_by(id, Year) %>% 
  slice(n()) %>% 
  ungroup()

final
#> # A tibble: 5 x 3
#>   id     Year   Val
#>   <fct> <dbl> <dbl>
#> 1 Alpha  1970    -2
#> 2 Alpha  1971     5
#> 3 Beta   1980     5
#> 4 Beta   1981     3
#> 5 Beta   1982     5

^{Created on 2019-09-29 by the reprex package (v0.3.0)}

Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."

You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.

edited Sep 29 '19 at 16:24

answered Sep 29 '19 at 15:17

DHW

1,157
1
9
24

1

@Alessandro Happy to help. Please remember to accept the answer! :) – DHW Sep 29 '19 at 15:21
Sure, but the system blocks me if I accept the answer before 10 minutes. you have to be a little patient – Alex Sep 29 '19 at 15:24
@Alessandro Didn't know that! – DHW Sep 29 '19 at 15:25
Would you have some idea to do the opposite instead? i.e only selecting the first row? that would be extremely helpful as well! – Alex Sep 29 '19 at 16:20
Yep, that's just the `distinct()` solution I suggested, as-is. `distinct(id, Year, .keep_all = TRUE)`. Or slice by group with `n()` instead of `1`. – DHW Sep 29 '19 at 16:22

akrun · Answer 3 · 2019-09-29T16:21:22.720

1

An option with base R

aggregate(Val ~ ., df, tail, 1)
#     id Year Val
#1 Alpha 1970  -2
#2 Alpha 1971   5
#3  Beta 1980   5
#4  Beta 1981   3
#5  Beta 1982   5

If we need to select the first row

aggregate(Val ~ ., df, head, 1)

edited Sep 29 '19 at 16:21

answered Sep 29 '19 at 16:06

akrun

874,273
37
540
662

Would you have some idea to do the opposite instead? i.e only selecting the first row? that would be extremely helpful as well! – Alex Sep 29 '19 at 16:20
1

Thanks! that is great – Alex Sep 29 '19 at 17:09

Deleting duplicated rows based on condition (position)

3 Answers3