removing rows of data based on multiple conditions

Question

Below is my current and desired data set. When date priority ID and revenue are the same but code is different I want to only keep the row with the 'highest' code.

The hierarchy for the codes is as follows: B>A>C. If there is any B, does not matter where in the string, it is assigned hierarchy 1.

large_df_have
   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3


large_df_want
   ID      Date Priority Revenue Code  V1  V2  V3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3

I understand `B>A>C` , but how does `AB` fit into that hierarchy? — thelatemail, May 17 '21 at 06:04
What happens if you have `'AB'` and `'B'` together? Which one do you give priority to? — Ronak Shah, May 17 '21 at 06:22

AnilGoyal · Accepted Answer · 2021-05-17T06:25:42.147

This will do

create one dummy col to create heirarchy among the codes as per given condition
then filter in only the highest priority row among these groups
remove dummy column (select(-..) if these are unwanted.

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups:   ID, Date, Priority, Revenue [1]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1

Check it on more complex case

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3
4 419 1/01/2020        1    -866    C XX3 XX1 XX3
5 420 1/01/2020        1    -866    A XX3 XX1 XX3
6 420 1/01/2020        1    -866    C XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups:   ID, Date, Priority, Revenue [3]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1
#> 2   419 1/01/2020        1    -866 C     XX3   XX1   XX3               3
#> 3   420 1/01/2020        1    -866 A     XX3   XX1   XX3               2

^{Created on 2021-05-17 by the reprex package (v2.0.0)}

score 0 · Answer 2 · answered May 17 '21 at 06:11

Depending on how many values of Code you have, you may create an additional numeric column reflecting the hierachy (incase the non-alphabetical order causes confusion). Check out ?ifelse if you aren't familiar, but the syntax is ifelse(test, yes, no) whereby if a test is TRUE the value returned is that specified by yes, otherwise it is no.

large_df_have %>%
   mutate(
      Code2 = ifelse(Code == 'B', 1, NA), # for the first time we make the Code2 column, 'no' values need to be NA 
      Code2 = ifelse(Code == 'A', 2, Code2), # rather than the 'no' result being NA as above, we keep any pre-existing values, eg the ones we just made in the line above
      Code2 = ifelse(Code == 'C', 3, Code2), 

      # and for your AB values (or others)
      Code2 = ifelse(Code == 'AB', 1.5, Code2)
      ) %>%
   
   # we create a group, of which we want the highest value of Code
   group_by(date, priority, ID, revenue) %>%

   # then we use filter() to keep the highest ranking rows for each group
   filter(
      Code2 == min(Code2)
   )

Without your data I can't test it out, but this approach should work.

score 0 · Answer 3 · answered May 17 '21 at 06:36

Without more context it's hard to provide code which does exactly as required.

Based on your question, two options jump to mind.

Option 1. You want to rank AB the same as B (for example).

Option 2. You want to rank AB differently to B (for example).

Option 1 is obviously problematic since the final row you use will be based on the order it appears in the original dataset. Option 2 might be better if the Code column represents an error. For example, if system with ID 418 has error code A and B, this is worse than just error code B.

library(dplyr)

df_have <- data.frame(ID   = c(418, 418, 418),
                      Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                      Priority = c(1, 1, 1), 
                      Revenue = c(-866, -866, -866), 
                      Code = c("A", "AB", "A"),
                      V1 = c("XX3", "XX2", "XX3"), 
                      V2 = c("XX1", "XX2", "XX1"),
                      V3 = c("XX3", "XX3", "XX3"))


# Option 1. Rank AB the same as B (for example)
df_want.1 <- df_have %>% 
  # add a numeric score based on the B > A > C ordering
  mutate(score = case_when(
    
    grepl("B", Code) ~ 3, 
    grepl("A", Code) ~ 2, 
    grepl("C", Code) ~ 1, 
    
  )) %>% 
  # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
  group_by(Date, Priority, ID, Revenue) %>% 
  # only keep the row for the group which has the highest score (or highest code)
  filter(score == max(score)) %>% 
  # AB and B will both produce a score of 3, so we only keep one of the rows in the group
  distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
  ungroup()
  
df_want.1

# Option 2. Rank AB above B (for example)
df_want.2 <- df_have %>% 
  # add a numeric score based on the B > A > C ordering
  mutate(score_b = if_else(grepl("B", Code), 3, 0), 
         score_a = if_else(grepl("A", Code), 2, 0), 
         score_c = if_else(grepl("C", Code), 1, 0)) %>% 
  # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
  group_by(Date, Priority, ID, Revenue) %>% 
  # add each of the scores together 
  mutate(row_score = score_b + score_a + score_c) %>% 
  # only keep the row for the group which has the highest score (or highest code combination)
  filter(row_score == max(row_score)) %>% 
  # assuming it's possible to have the same score across the group, only keep first row in the group
  distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
  ungroup()

df_want.2

score 0 · Answer 4 · answered May 17 '21 at 06:44

You can create a factor, and then use dplyr::distinct():

library(dplyr)

df_have <- data.frame(ID   = c(418, 418, 418),
                      Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                      Priority = c(1, 1, 1), 
                      Revenue = c(-866, -866, -866), 
                      Code = c("A", "AB", "A"),
                      V1 = c("XX3", "XX2", "XX3"), 
                      V2 = c("XX1", "XX2", "XX1"),
                      V3 = c("XX3", "XX3", "XX3"))

# Hierarchy of combinations
hierarchy <- c("B", "BA", "AB", "A", "C")

# Create factor
df_have %>%
  mutate(Code = factor(Code, levels = hierarchy)) %>%
  arrange(ID, Date, Priority, Revenue, Code) %>%
  distinct(ID, Date, Priority, Revenue, .keep_all = TRUE)
#>    ID      Date Priority Revenue Code  V1  V2  V3
#> 1 418 1/01/2020        1    -866   AB XX2 XX2 XX3

^{Created on 2021-05-17 by the reprex package (v2.0.0)}

score 0 · Answer 5 · answered May 17 '21 at 06:49

I think there are still some minor issues to clarify here, but self-plagiarising from this old question: How to reclassify/replace values based on priority when there are repeats , I think it makes sense to use an ordered factor:

library(data.table)

## set the order (small to large)
lev <- c("C","A","B")
setDT(have)

have[, ord := ordered(sapply(strsplit(Code, ""), 
                      function(x) max(ordered(x,levels=lev))), levels=lev)]
have[ have[, which.max(ord), by=.(ID, Date, Priority, Revenue)]$V1, ]

#    ID      Date Priority Revenue Code  V1  V2  V3 ord
#1: 418 1/01/2020        1    -866   AB XX2 XX2 XX3   B
#2: 418 1/01/2020        1    -866    A XX3 XX1 XX3   A

Using this extended data with two groups:

have <- read.table(text="
ID      Date Priority Revenue Code  V1  V2  V3
418 1/01/2020        1    -866    A XX3 XX1 XX3
418 1/01/2020        1    -866   AB XX2 XX2 XX3
418 1/01/2020        1    -866    A XX3 XX1 XX3
419 1/01/2020        1    -866    A XX3 XX1 XX3
419 1/01/2020        1    -866    A XX2 XX2 XX3
419 1/01/2020        1    -866    C XX3 XX1 XX3
", header=TRUE, stringsAsFactors=FALSE)

removing rows of data based on multiple conditions

5 Answers5