Remove duplicated rows based on 2 columns in R

Question

I am new in R. I have this data frame:

TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:05    1      NA    Windows
12:00:13    1      NA    Windows
12:00:48    NA     1     Macintosh
12:01:30    NA     1     Macintosh
12:01:45    NA     1     Macintosh
12:02:01    2      NA    Windows
12:02:13    2      NA    Windows

And I want to remove duplicated rows based on columns IndexA and Value or IndexB and Value. So, in the end the data frame should be seen like this:

TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:48    NA     1     Macintosh
12:02:01    2      NA    Windows

How can I manage this ?

Please post the code you have tried so far.. Also, seems like possible duplicate of https://stackoverflow.com/questions/13742446/duplicates-in-multiple-columns — Tanu, Dec 16 '17 at 16:20
@Tanu I am sure there have been similar questions already been asked on SO but the Q you have linked to is not a duplicate, IMHO. In the linked Q *all* rows containing duplicate entries are to be removed whereas here the first entry is to be kept. — Uwe, Dec 17 '17 at 12:02

score 5 · Answer 1 · answered Dec 16 '17 at 15:51

A solution using dplyr. We can use the distinct function. -TimeStamp is to specify this column is excluded when considering duplication. .keep_all = TRUE means to keep all the columns.

library(dplyr)

dt2 <- dt %>%
  distinct(-TimeStamp, .keep_all = TRUE)
dt2
# # A tibble: 3 x 4
#   TimeStamp IndexA IndexB     Value
#       <chr>  <int>  <int>     <chr>
# 1  12:00:01      1     NA   Windows
# 2  12:00:48     NA      1 Macintosh
# 3  12:02:01      2     NA   Windows

DATA

dt <- read.table(text = "TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
                 12:00:05    1      NA    Windows
                 12:00:13    1      NA    Windows
                 12:00:48    NA     1     Macintosh
                 12:01:30    NA     1     Macintosh
                 12:01:45    NA     1     Macintosh
                 12:02:01    2      NA    Windows
                 12:02:13    2      NA    Windows",
                 header = TRUE, stringsAsFactors = FALSE)

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

For the sake of completeness, the unique() function from the data.table package can be used as well:

library(data.table)
unique(setDT(df), by = "IndexA")

   TimeStamp IndexA IndexB     Value
1:  12:00:01      1     NA   Windows
2:  12:00:48     NA      1 Macintosh
3:  12:02:01      2     NA   Windows

This is looking for unique values only in IndexA which is equivalent to Tito Sanz' answer. Obviously, this approach returns the expected result for the given sample data set but checking only one column for duplicate entries is oversimplifying IMHO and may fail with production data.

Or, looking for unique combinations of the values in three columns (which is equivalent to www's answer):

unique(setDT(df), by = 2:4) # very terse
unique(setDT(df), by = c("IndexA", "IndexB", "Value")) # explicitely named cols

   TimeStamp IndexA IndexB     Value
1:  12:00:01      1     NA   Windows
2:  12:00:48     NA      1 Macintosh
3:  12:02:01      2     NA   Windows

Data

library(data.table)
df <- fread(
  "TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:05    1      NA    Windows
12:00:13    1      NA    Windows
12:00:48    NA     1     Macintosh
12:01:30    NA     1     Macintosh
12:01:45    NA     1     Macintosh
12:02:01    2      NA    Windows
12:02:13    2      NA    Windows")

Tito Sanz · Answer 3 · 2017-12-16T16:02:17.947

0

You can use distinct function in tidyverse package.

foo %>% 
  distinct(IndexA, .keep_all = T)

Here is the result:

  TimeStamp IndexA IndexB     Value
1  12:00:01      1     NA   Windows
2  12:00:48     NA      1 Macintosh
3  12:02:01      2     NA   Windows

edited Dec 16 '17 at 16:02

answered Dec 16 '17 at 15:46

Tito Sanz

1,280
1
16
33

Remove duplicated rows based on 2 columns in R

3 Answers3

Data