2

I have a df where value indicates the status of a drug:

g1 = data.frame ( 
    drug = c('a','a','a','d','d'),
    value = c('fda','trial','case','case','pre')
)

drug value
1    a   fda
2    a trial
3    a  case
4    d  case
5    d   pre

So for drugs, I want to replace any repeat drug based on the following order-of-priority for value:

fda > trial > case > pre 

So for example if drug d is "case" as well as "pre", all incidence of d will be reclassify as "case". The final table should look like this.

  drug value
1    a   fda
2    a   fda
3    a   fda
4    d  case
5    d  case

How to do this without having to loop through each drug and figuring out the precedence first then replacing?

smci
  • 32,567
  • 20
  • 113
  • 146
Ahdee
  • 4,679
  • 4
  • 34
  • 58
  • 2
    Use `dplyr::mutate(value = case_when(...))` ; this is a duplicate of lots of existing questions. – smci Mar 14 '19 at 00:55
  • Possible duplicate of [dplyr mutate with conditional values](https://stackoverflow.com/questions/22337394/dplyr-mutate-with-conditional-values) – smci Mar 14 '19 at 00:58
  • 1
    @smci - imho this isn't a dupe. Dealing with it as an ordinal variable seems more straight-forward than writing several case when statements. – thelatemail Mar 14 '19 at 01:18
  • @thelatemail: yes it most certainly is a dupe of lots of existing questions; there are [10,686 hits(!) on `dplyr mutate` alone](https://stackoverflow.com/search?q=%5Bdplyr%5D+mutate) and [469 hits on `dplyr mutate/ case_when` or `ifelse`](https://stackoverflow.com/search?q=%5Bdplyr%5D+mutate+case_when). I've been wading through them for a while, the only question is what's the best dupe target. Can you pick one? – smci Mar 14 '19 at 01:53
  • @smci - none of the 3 answers here have used case_when or ifelse so I think that logic is not a good match to what is essentially just taking a minumum by group once the ordered variable is set. There could possibly be a match out there but i'm not having much luck finding something very good - https://stackoverflow.com/questions/39403317/how-do-you-find-a-maximum-character-of-a-vector-based-on-user-defined-hierarchy is a similar idea but not quite. – thelatemail Mar 14 '19 at 02:07
  • It would help if Ahdee states whether **each value of `fda, trial, case, pre` is guaranteed to exist, and in that order, for each `drug`**. That's not stated explicitly but answers seem to rely on it. (Yeah @thelatemail if that is an explicit constraint it changes the answers). What if a drug had an entry for `fda` but not `trial`, which should we then replace `case` with? Or can code assume that cannot happen? – smci Mar 14 '19 at 02:16
  • @smci - if it's an ordered factor as per my answer, it don't think it will matter. The minimum value will be the first in the order specified. See the edit to my answer. – thelatemail Mar 14 '19 at 02:37
  • @thelatemail: well the onus is on the OP to state the question constraints, clearly. Your answer is pretty neat, but if someone converts to string/unordered factor/ordinal then exports (e.g. as csv) and rereads, it could be brittle. What if a new or unseen drug-trial status is encountered? etc. – smci Mar 14 '19 at 02:41
  • @smci - if there's a new unseen drug-trial status, then I imagine any code, whether factor-based or case/when-based will have to change to account for it. – thelatemail Mar 14 '19 at 02:46
  • HI thanks everyone for your help in this. I did tried to look through the previous answers but could not find one that was able to solve my problem until now. To answer above questions, values is not gauranteed to exists however as stated by @smci the ordered factor works regardless of missing value. – Ahdee Mar 14 '19 at 03:27
  • (There was [this question](https://stackoverflow.com/questions/44321321/summarise-an-ordered-factor-in-a-grouped-data-frame-with-dplyr), except using `summarise` on ordered factor instead of `mutate`, and it was more about an old bug in dplyr 0.5.0, and it has no answers, so don't use it). – smci Mar 14 '19 at 03:46
  • Ok I retracted my close-as-dupe vote: it is best handled with an ordered categorical and `min()`. The title seriously doesn't tell us what the question is about, in the code sense... – smci Mar 15 '19 at 23:22
  • @smci thanks for following up on this and the answers below has tremendously helped. Not sure how to edit the title though? any suggestions? – Ahdee Mar 16 '19 at 17:28
  • @Ahdee: sure. if you want to edit the title to be more clear, please do; only the OP is supposed to do that once a question has answers. It depends how you want to state your problem; the current title uses really vague language; also your code implies both `drug` and `value` are strings, but later on answerers infer it would have been better to declare `value` as an ordered categorical, with the specific order-of-precedence you gave (so `min()` works as intended). Perhaps best to ask **"How best to represent dataframe column when I want to enforce order-of-precedence on its values?"** – smci Mar 17 '19 at 02:52

3 Answers3

5

Since this is an ordinal variable, you can make g1$value an ordered factor as the corresponding class. Then you can use functions like min and max like you would a numeric:

g1$value <- ordered(g1$value, levels = c("fda", "trial", "case", "pre"))
g1$value
#[1] fda   trial case  case  pre  
#Levels: fda < trial < case < pre
g1$value <- ave(g1$value, g1$drug, FUN=min)
g1
#  drug value
#1    a   fda
#2    a   fda
#3    a   fda
#4    d  case
#5    d  case

Or in dplyr speak:

g1 %>%
  mutate(value = ordered(value, levels = c("fda", "trial", "case", "pre"))) %>%
  group_by(drug) %>%
  mutate(value = min(value))

The order in the dataset and the range of values present in any drug group shouldn't affect this result:

g2 = data.frame ( 
    drug = c( "a","a","a","d","d","e","e","e"),
    value = c("fda","trial","case","case","pre","pre","fda","case")
)

#  drug value
#1    a   fda
#2    a trial
#3    a  case
#4    d  case
#5    d   pre
#6    e   pre
#7    e   fda
#8    e  case

g2 %>%
  mutate(value = ordered(value, levels = c("fda", "trial", "case", "pre"))) %>%
  group_by(drug) %>%
  mutate(value = min(value))

## A tibble: 8 x 2
## Groups:   drug [3]
#  drug  value
#  <fct> <ord>
#1 a     fda  
#2 a     fda  
#3 a     fda  
#4 d     case 
#5 d     case 
#6 e     fda  
#7 e     fda  
#8 e     fda 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • this is great and works well. I especially like the first example since it requires just the base R and it makes lot of sense. Answer above about dups though, although there are a lot of similar questions I cannot find one that could've answer what I needed. – Ahdee Mar 14 '19 at 03:30
3

Update using a map vector , what I used to do , since I do not want to change the columns type.

mapvect=c(1,2,3,4)
names(mapvect)=c('pre','case','trial','fda')
g1$helpkey=mapvect[g1$value]

g1 %>% group_by(drug) %>% arrange(value)%>% dplyr::mutate(value=value[helpkey==max(helpkey)])
# A tibble: 5 x 3
# Groups:   drug [2]
drug value helpkey
<chr> <chr>   <dbl>
1     a   fda       2
2     d  case       2
3     a   fda       4
4     d  case       1
5     a   fda       3
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This is not as good as @thelatemail's answer, it creates an unordered factor and relies on the desired order of values occurring in-order in the dataframe; if that doesn't happen it breaks. Better approach is ordered categorical with `min`. – smci Mar 15 '19 at 23:31
  • You don't need to create a helper column, can't you just directly operate on `min(value)`? Anyway if you do create it, you want to remove the helper column `%>% select(-helpkey)` – smci Mar 17 '19 at 12:41
3

Similar to @Wen-Ben's answer, with base functions you could also do:

g1$value <- factor(g1$value, levels = c("fda", "trial", "case", "pre"))
g1 <- g1[order(g1$value),]
g1$value <- g1[match(g1$drug, g1$drug), "value"]
MKa
  • 2,248
  • 16
  • 22