Set a variable to NA if other variables are duplicates in R

Question

I have the following data frame containing medicine code according to the route of administration:

code <- data.frame(inn = c("ibuprofen", "ibuprofen", "ibuprofen", "fusidic acid", "fusidic acid"),
                   route = c("unknown", "unknown", "unknown", "oral", "topical"),
                   atc = c("R02AX02", "G02CC01", "M01AE01", "J01XC01", "D06AX01"))

           inn   route     atc
1    ibuprofen unknown R02AX02
2    ibuprofen unknown G02CC01
3    ibuprofen unknown M01AE01
4 fusidic acid    oral J01XC01
5 fusidic acid topical D06AX01

And another one containing patient treatment and event:

event <- data.frame(id = c(1, 1, 2),
                    inn = c("ibuprofen", "fusidic acid", "fusidic acid"),
                    route = c("unknown", "oral", "topical"),
                    event = c(TRUE, FALSE, TRUE))

  id          inn   route event
1  1    ibuprofen unknown  TRUE
2  1 fusidic acid    oral FALSE
3  2 fusidic acid topical  TRUE

I need to merge those data frames to get the following result:

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE NA

I do not get this result with a simple merge:

merge(x = event,
      y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE R02AX02
4    ibuprofen unknown  1  TRUE G02CC01
5    ibuprofen unknown  1  TRUE M01AE01

I thought of two solutions, but I did not manage to implement any:

modify the code data frame before the merge to set atc to NA if there is different atc for a group of inn and route (this seems more appropriate)
modify the result of merge to set atc to NA if there is different atc for a group of inn, route and id

How can I do it in base R? Is there another better way? I work in a restrictive environment where I only have access to base R.

score 2 · Answer 1 · answered Apr 06 '19 at 01:18

2

The code for the case 2:

code$inn_route <- paste0(code$inn,'_',code$route)
code$count <- table(code$inn_route)[code$inn_route]
code[code$count>1,3]<-NA
code$inn_route <- NULL
code$count <- NULL
code <- unique(code)
merge(event,code)


           inn   route id event   atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>

answered Apr 06 '19 at 01:18

Grzegorz Sionkowski

499
3
6

Thank you for your answer. Your reasoning gave me an idea for a more straightforward way (see my answer). – Thomas Apr 06 '19 at 11:23

score 1 · Accepted Answer · answered Apr 25 '19 at 19:56

Here's a straightforward way to accomplish option 2. Starting with the result of a simple merge:

mrg <- merge(x = event,
             y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE R02AX02
4    ibuprofen unknown  1  TRUE G02CC01
5    ibuprofen unknown  1  TRUE M01AE01

We then check which rows are duplicated (dropping the atc variable). We need to use duplicated twice because it actually finds duplicate rows, not rows that have duplicates. So, it would catch rows 4 and 5, but not 3 – to get that, we need to repeat duplicated from the opposite direction. Read more here: Finding ALL duplicate rows, including “elements with smaller subscripts”:

mrg$atc <- ifelse(duplicated(mrg[,-5]) | duplicated(mrg[,-5], fromLast = T),
                  NA,
                  mrg$atc)
mrg

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>
4    ibuprofen unknown  1  TRUE    <NA>
5    ibuprofen unknown  1  TRUE    <NA>

If you want to get rid of the duplicate rows 4 and 5, just run duplicated one more time to drop them:

mrg[!duplicated(mrg),]

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>

Thank you for your answer ! This method is much quicker than the `ave` method. I applied the `ifelse` on `code[c("inn", "route")]` before the `merge` to merge smaller data frames to be more efficient and I used `unique(x = code)` instead of `code[!duplicated(x = code), ]` because it is more efficient according to the doc. — Thomas, Apr 26 '19 at 08:23

Thomas · Answer 3 · 2019-04-25T15:36:27.373

Grzegorz Sionkowski's answer led me to the following solution:

code$atc <- as.character(x = code$atc)

code$atc <- ifelse(test = ave(x = code$atc,
                              code$inn,
                              code$route,
                              FUN = length) > 1,
                   yes = NA,
                   no = code$atc)

code <- unique(x = code)

merge(x = event,
      y = code)

           inn   route id event     atc
1 fusidic acid    oral  1 FALSE J01XC01
2 fusidic acid topical  2  TRUE D06AX01
3    ibuprofen unknown  1  TRUE    <NA>

However, as ave is quite slow on my real data, I am wondering if there is a faster base R method.

Set a variable to NA if other variables are duplicates in R

3 Answers3