Replacing duplicates within a row with NA keeping the first in R in the row

Question

I am trying to remove duplicate data rowwise. For e.g.

> head(data_Final)
  ID   Value1    Value2  Value3   Value4
1 a      876      989      989      758
2 b      921      801      971      995
3 c      636      889      7724      95
4 d      999      999      896      999
5 e      251      254      251      235
6 f      552      100      669      015

I need the results like:

> head(data_Final)
  ID   Value1    Value2  Value3   Value4
1 a      876      989      NA      758
2 b      921      801      971     995
3 c      636      889      7724    95
4 d      999      NA       896     NA
5 e      251      254      NA      235
6 f      552      100      669     015

I searched a lot but the results were for duplicates in a column not in a row.

You could pivot the data to long form, group by ID, mutate the duplicate values, and then pivot back to wide form. — statstew, May 22 '20 at 19:36

akrun · Answer 1 · 2020-05-22T20:04:28.643

We can use apply to loop over the numeric columns, and replace the duplicated elements with NA

data_Final[-1] <- t(apply(data_Final[-1], 1,
       function(x) replace(x, duplicated(x), NA)))
data_Final
#  ID Value1 Value2 Value3 Value4
#1  a    876    989     NA    758
#2  b    921    801    971    995
#3  c    636    889   7724     95
#4  d    999     NA    896     NA
#5  e    251    254     NA    235
#6  f    552    100    669     15

The apply can be changed to for loop as well

for(i in seq_len(nrow(data_Final))) {
   tmp <- data_Final[i, -1]
   data_Final[i, -1] <- replace(tmp, duplicated(tmp), NA)
  }

Or using pmap from purrr

library(dplyr)
library(purrr)
data_Final %>%
     select(-ID) %>%
     pmap_dfr(., ~ c(...) %>%
            replace(., duplicated(.), NA)) %>%
     bind_cols(select(data_Final, ID), .)

Benchmarks

system.time(t(apply(data_Final[-1], 1,
       function(x) replace(x, duplicated(x), NA))))
#   user  system elapsed 
#  0.013   0.003   0.015 


system.time(for(i in seq_len(nrow(data_Final))) {
        tmp <- data_Final[i, -1]
        data_Final[i, -1] <- replace(tmp, duplicated(tmp), NA)
       }
 )
#   user  system elapsed 
#  0.014   0.004   0.018

Regarding the discussion of for vs apply, it is already documented in multiple posts here, here and there is not much difference

data

data_Final <- structure(list(ID = c("a", "b", "c", "d", "e", "f"), 
     Value1 = c(876L, 
921L, 636L, 999L, 251L, 552L), Value2 = c(989L, 801L, 889L, 999L, 
254L, 100L), Value3 = c(989L, 971L, 7724L, 896L, 251L, 669L), 
    Value4 = c(758L, 995L, 95L, 999L, 235L, 15L)), class = "data.frame",
    row.names = c("1", 
"2", "3", "4", "5", "6"))

@Onyambu with row wise, it is easier to express and get the calculation instead of pivoting/repivot etc. I like the elegance of for loop, but it is just subjective — akrun, May 22 '20 at 20:06
@akrun Yes, I also found for-loop more readable often. And sometimes really "faster" Because when using `apply` on data frames, often one has to use `t()` and `data.frame()` to be able to do it. But as I said, sometimes with `apply()` the computer became very slow. It was appreading when df was really huge. Maybe because one changes in-situ the values ... — Gwang-Jin Kim, May 22 '20 at 20:19

Gwang-Jin Kim · Answer 2 · 2020-05-22T20:26:03.330

2

replace.dup <- function(x, val=NA) {
    x[duplicated(x)] <- val
    x
}


replace.row.wise.dups <- function(df, val=NA) {
    for (i in 1:nrow(df)) {
      df[i, ] <- replace.dup(unlist(df[i, , drop=T]), val)
    }
    df
}

In this case I would just use for-loop. (l)apply and t() can bend your mind quite easily ... for is not THAT slow. Sometimes apply functions are very slow.

recreate `df`

text <- "  ID   Value1    Value2  Value3   Value4
1 a      876      989      989      758
2 b      921      801      971      995
3 c      636      889      7724      95
4 d      999      999      896      999
5 e      251      254      251      235
6 f      552      100      669      015"

txt <- gsub("\ +", "\t", text)

df <- read.delim(text = txt, sep="\t", row.names=1, stringsAsFactors=FALSE)

run it

replace.row.wise.dups(df, NA)

#   ID Value1 Value2 Value3 Value4
# 1  a    876    989   <NA>    758
# 2  b    921    801    971    995
# 3  c    636    889   7724     95
# 4  d    999   <NA>    896   <NA>
# 5  e    251    254   <NA>    235
# 6  f    552    100    669     15

`for`-loop slightly slower than `apply`

Sorry I was wrong. I thougth for-loop is faster. But in case your df is very large, for-loop might give you more speed. Probably because of memory issues, as @akrun pointed out.

# from @akrun
replace.row.wise.dup.1 <- function(df, val=NA) {
  as.data.frame(t(apply(df, 1, function(x) replace(x, duplicated(x), NA))))
}

require(microbenchmark)
mbm <- microbenchmark("apply" = replace.row.wise.dups.1(df, NA),
                      "for-loop" = replace.row.wise.dups(df, NA),
                      times = 1000)
# > mbm
# Unit: microseconds
#      expr     min       lq     mean   median      uq      max neval
#     apply 600.950 623.9905 673.0897 632.4485 645.910 3668.063  1000
#  for-loop 696.792 727.8785 791.7684 754.1875 772.129 2491.147  1000

edited May 22 '20 at 20:26

answered May 22 '20 at 19:15

Gwang-Jin Kim

9,303
17
30

@akrun yes, that I had from you. (I had instead assignments). But it is about the for-loop in this case. `apply` functions for such stuff at data frames can become very slow. – Gwang-Jin Kim May 22 '20 at 19:23
@akrun - I told you, my answer is not about `replace()`. It is about to apply `for`-loop. In R, there is a myth that `apply` functions are faster and more efficient thatn the `for`-loop. But it is simply not true. – Gwang-Jin Kim May 22 '20 at 19:46
@akrun I will change my functiion because you are making the point with `reaplace()`. – Gwang-Jin Kim May 22 '20 at 19:53
@akrun not true, my function doesn't change in-place the `df` outside. – Gwang-Jin Kim May 22 '20 at 19:55
@akrun I don't care. My point was just `for`-loop vs. `apply()` and all its consequences `t()` and `data.frame()` ... – Gwang-Jin Kim May 22 '20 at 20:00
@akrun you are right, I did some mistake in benchmark. Took wrong function. – Gwang-Jin Kim May 22 '20 at 20:04
1

You could compile the `for` loop and make it a bit more faster, but any case it is pointless to do comparison with apply and that was my take. Reason i commented strongly is to give correct info and nothng else. thanks – akrun May 22 '20 at 20:08
@akrun Thanks! I thought `for`-loop in this case is faster. Sometimes one has to do `t()` with `data.frame()` because of apply. I wanted to prove myself just out of curiosity. Sorry, didn't want to annoy you. – Gwang-Jin Kim May 22 '20 at 20:11
1

I find the `t` to be very fast though, i could be wrong. Only issue may be the memory for large datasets – akrun May 22 '20 at 20:12
1

@akrun Ah that might be true that it is the memory footprint ... sometimes apply-functions were taking more than 30 min to process a huge data frame. That is true ... could be because of memory ... (I am in bioinformatics and we have sometimes several hundreds Mb or even several Gb sized tables) – Gwang-Jin Kim May 22 '20 at 20:14

score 2 · Answer 3 · answered May 22 '20 at 20:04

2

you could use ave:

data_Final[-1]<-ave(unlist(data_Final[-1]), row(data_Final[-1]),
                    FUN = function(x)`is.na<-`(x,duplicated(x)))
data_Final
  ID Value1 Value2 Value3 Value4
1  a    876    989     NA    758
2  b    921    801    971    995
3  c    636    889   7724     95
4  d    999     NA    896     NA
5  e    251    254     NA    235
6  f    552    100    669     15

answered May 22 '20 at 20:04

Onyambu

67,392
3
24
53

interesting to use ``is.na<-`` directly – Gwang-Jin Kim May 22 '20 at 20:13

Replacing duplicates within a row with NA keeping the first in R in the row

3 Answers3

Benchmarks

data

recreate df

run it

for-loop slightly slower than apply

recreate `df`

`for`-loop slightly slower than `apply`