0

I have a dataframe looking as:

    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    0         C
4   2020-04-01    1         C

I'd like to filter the duplicated rows, but only based on the columns Day and Element, and keep the rows where State is 1. My expected output is:

    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    1         C

I've tried with drop_duplicates and distinct but there not working properly.

MustardRecord
  • 305
  • 4
  • 14

2 Answers2

1

arrange the data based on decreasing value of State (so 1 comes before 0) and use distinct.

library(dplyr)
df %>%
  arrange(Element, desc(State)) %>%
  #If you have values greater than 1 in `State` use
  #arrange(Element, State != 1) %>%
  distinct(Day, Element, .keep_all = TRUE)

#         Day State Element
#1 2020-04-01     0       A
#2 2020-04-01     0       B
#4 2020-04-01     1       C

To keep all the 1's in your data try :

df %>%
  group_by(Day, Element) %>%
  slice(if(any(State == 1)) which(State == 1) else 1L)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

data

df <- read.table(text = "    Day           State     Element
1   2020-04-01    0         A
2   2020-04-01    0         B
3   2020-04-01    0         C
4   2020-04-01    1         C", header = T)

tidyverse

library(tidyverse)
df %>% 
  group_by(Day, Element) %>% 
  arrange(State) %>% 
  slice_tail(n = 1) %>% 
  ungroup()
#> # A tibble: 3 x 3
#>   Day        State Element
#>   <chr>      <int> <chr>  
#> 1 2020-04-01     0 A      
#> 2 2020-04-01     0 B      
#> 3 2020-04-01     1 C

data.table

library(data.table)

setDT(df)
df[order(Element, Day, State)][, list(State = last(State)), by = list(Day, Element)]
#>           Day Element State
#> 1: 2020-04-01       A     0
#> 2: 2020-04-01       B     0
#> 3: 2020-04-01       C     1

Created on 2021-02-10 by the reprex package (v1.0.0)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14