0

I am new in R. I have this data frame:

TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:05    1      NA    Windows
12:00:13    1      NA    Windows
12:00:48    NA     1     Macintosh
12:01:30    NA     1     Macintosh
12:01:45    NA     1     Macintosh
12:02:01    2      NA    Windows
12:02:13    2      NA    Windows

And I want to remove duplicated rows based on columns IndexA and Value or IndexB and Value. So, in the end the data frame should be seen like this:

TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:48    NA     1     Macintosh
12:02:01    2      NA    Windows

How can I manage this ?

Uwe
  • 41,420
  • 11
  • 90
  • 134
Volkan Demir
  • 13
  • 1
  • 4
  • Please post the code you have tried so far.. Also, seems like possible duplicate of https://stackoverflow.com/questions/13742446/duplicates-in-multiple-columns – Tanu Dec 16 '17 at 16:20
  • @Tanu I am sure there have been similar questions already been asked on SO but the Q you have linked to is not a duplicate, IMHO. In the linked Q *all* rows containing duplicate entries are to be removed whereas here the first entry is to be kept. – Uwe Dec 17 '17 at 12:02
  • Thanks @Uwe for pointing out. – Tanu Dec 17 '17 at 14:18

3 Answers3

5

A solution using dplyr. We can use the distinct function. -TimeStamp is to specify this column is excluded when considering duplication. .keep_all = TRUE means to keep all the columns.

library(dplyr)

dt2 <- dt %>%
  distinct(-TimeStamp, .keep_all = TRUE)
dt2
# # A tibble: 3 x 4
#   TimeStamp IndexA IndexB     Value
#       <chr>  <int>  <int>     <chr>
# 1  12:00:01      1     NA   Windows
# 2  12:00:48     NA      1 Macintosh
# 3  12:02:01      2     NA   Windows

DATA

dt <- read.table(text = "TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
                 12:00:05    1      NA    Windows
                 12:00:13    1      NA    Windows
                 12:00:48    NA     1     Macintosh
                 12:01:30    NA     1     Macintosh
                 12:01:45    NA     1     Macintosh
                 12:02:01    2      NA    Windows
                 12:02:13    2      NA    Windows",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
5

For the sake of completeness, the unique() function from the data.table package can be used as well:

library(data.table)
unique(setDT(df), by = "IndexA")
   TimeStamp IndexA IndexB     Value
1:  12:00:01      1     NA   Windows
2:  12:00:48     NA      1 Macintosh
3:  12:02:01      2     NA   Windows

This is looking for unique values only in IndexA which is equivalent to Tito Sanz' answer. Obviously, this approach returns the expected result for the given sample data set but checking only one column for duplicate entries is oversimplifying IMHO and may fail with production data.

Or, looking for unique combinations of the values in three columns (which is equivalent to www's answer):

unique(setDT(df), by = 2:4) # very terse
unique(setDT(df), by = c("IndexA", "IndexB", "Value")) # explicitely named cols
   TimeStamp IndexA IndexB     Value
1:  12:00:01      1     NA   Windows
2:  12:00:48     NA      1 Macintosh
3:  12:02:01      2     NA   Windows

Data

library(data.table)
df <- fread(
  "TimeStamp  IndexA IndexB Value
12:00:01    1      NA    Windows
12:00:05    1      NA    Windows
12:00:13    1      NA    Windows
12:00:48    NA     1     Macintosh
12:01:30    NA     1     Macintosh
12:01:45    NA     1     Macintosh
12:02:01    2      NA    Windows
12:02:13    2      NA    Windows")
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

You can use distinct function in tidyverse package.

foo %>% 
  distinct(IndexA, .keep_all = T)

Here is the result:

  TimeStamp IndexA IndexB     Value
1  12:00:01      1     NA   Windows
2  12:00:48     NA      1 Macintosh
3  12:02:01      2     NA   Windows
Tito Sanz
  • 1,280
  • 1
  • 16
  • 33