R dataframe: drop duplicates based on certain columns

Question

I have a dataframe looking as:

    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    0         C
4   2020-04-01    1         C

I'd like to filter the duplicated rows, but only based on the columns Day and Element, and keep the rows where State is 1. My expected output is:

    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    1         C

I've tried with drop_duplicates and distinct but there not working properly.

Ronak Shah · Answer 1 · 2021-02-10T14:49:28.753

1

arrange the data based on decreasing value of State (so 1 comes before 0) and use distinct.

library(dplyr)
df %>%
  arrange(Element, desc(State)) %>%
  #If you have values greater than 1 in `State` use
  #arrange(Element, State != 1) %>%
  distinct(Day, Element, .keep_all = TRUE)

#         Day State Element
#1 2020-04-01     0       A
#2 2020-04-01     0       B
#4 2020-04-01     1       C

To keep all the 1's in your data try :

df %>%
  group_by(Day, Element) %>%
  slice(if(any(State == 1)) which(State == 1) else 1L)

edited Feb 10 '21 at 14:49

answered Feb 10 '21 at 11:54

Ronak Shah

377,200
20
156
213

Hi! Thanks for the answer but in my case it's only keeping one row of the dataframes, not all the rows with 1s – MustardRecord Feb 10 '21 at 14:45
Sorry, that wasn't clear from your question. You can try the updated answer. – Ronak Shah Feb 10 '21 at 14:50

score 0 · Answer 2 · answered Feb 10 '21 at 12:57

data

df <- read.table(text = "    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    0         C
4   2020-04-01    1         C", header = T)

tidyverse

library(tidyverse)
df %>% 
  group_by(Day, Element) %>% 
  arrange(State) %>% 
  slice_tail(n = 1) %>% 
  ungroup()
#> # A tibble: 3 x 3
#>   Day        State Element
#>   <chr>      <int> <chr>  
#> 1 2020-04-01     0 A      
#> 2 2020-04-01     0 B      
#> 3 2020-04-01     1 C

data.table

library(data.table)

setDT(df)
df[order(Element, Day, State)][, list(State = last(State)), by = list(Day, Element)]
#>           Day Element State
#> 1: 2020-04-01       A     0
#> 2: 2020-04-01       B     0
#> 3: 2020-04-01       C     1

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

Thanks but it's not really working in my case... – MustardRecord Feb 10 '21 at 14:48 — MustardRecord, Feb 10 '21 at 14:48
Please provide an extended example – Yuriy Saraykin Feb 10 '21 at 16:12 — Yuriy Saraykin, Feb 10 '21 at 16:12

R dataframe: drop duplicates based on certain columns

2 Answers2

Linked

Related