Remove "duplicated" rows from data frame (they differ in few columns)

Question

So I have a similar problem to this question: Remove duplicate rows in R

In my case I would like to keep all of the columns (not like it was recommended to use unique function on first 3 columns). I would like to take to the consideration only 2 columns from the data frame and keep only 1 row if the "values" in both mentioned columns are the same.

The data looks like:

structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), row.names = c(NA, -20L), class = "data.frame")

The important columns for me are: P1 and P2. I would like to keep only one of the rows which we can the same fruits/veggies. (remember that the fruits/veggies have to be the same in both columns):

Example:

before:

       P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
2   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
3  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
6  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
7  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge
8  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

After:

    P1       P2 P1_location_subacon            P1_location_all_predictors P2_location_subacon P2_location_all_predictors
1   Apple   Orange                <NA>       Table,Shelf,Cupboard,Bed,Fridge              Fridge         Table,Shelf,Fridge
4  Orange    Lemon              Fridge                    Table,Shelf,Fridge              Fridge           Shelf,Fridge,Bed
5  Tomato   Potato              Fridge                    Table,Shelf,Fridge                <NA>               Shelf,Fridge

It doesn't matter which of the rows it will keep. That can be randomly selected.

Same-ish question from yesterday: http://stackoverflow.com/questions/31148152/having-trouble-keeping-all-variables-after-removing-duplicates-from-a-dataset — Frank, Jul 01 '15 at 15:00

score 6 · Accepted Answer · answered Jul 01 '15 at 14:59

6

Just use duplicated() on the subset of columns that you want to make sure are unique and use that to subset the main data.frame. For example

dd[ !duplicated(dd[,c("P1","P2")]) , ]

answered Jul 01 '15 at 14:59

MrFlick

195,160
17
277
295

1

Or `unique` with `by` option ie. `library(data.table);unique(dd, by = c('P1', 'P2'))` – akrun Jul 01 '15 at 15:04
@akrun does `dd` have to be a data.table for that to work, or does `data.table` modify the standard `unique()` function for data.frames as well? – MrFlick Jul 01 '15 at 15:06
It did worked for the `dput` data. I am using the devel version of data.table – akrun Jul 01 '15 at 15:07
1

Yeah, I didn't expect it to work, but I guess it's general. – Frank Jul 01 '15 at 15:07
The `dplyr` analogue is `dd %>% distinct(P1,P2)`, in case you want to lock down all the answers that are equivalent to this one. – Frank Jul 01 '15 at 15:08

TheComeOnMan · Answer 2 · 2015-07-01T15:06:24.210

3

If dt is your data frame -

library(data.table)
setDT(dt)

dtFiltered = dt[,
   Flag := .I - min(.I), 
   list(P1,P2)
][
   Flag == 0
]
dtFiltered = dtFiltered[,
  Flag := NULL
]

Thanks for Frank for pointing out I missed the P2.

edited Jul 01 '15 at 15:06

answered Jul 01 '15 at 14:57

TheComeOnMan

12,535
8
39
54

1

Oops, yes. Thanks. Corrected – TheComeOnMan Jul 01 '15 at 15:06
Oh now I see how it works. Instead of flagging, you could just extract the flagged rows, like `dt[ dt[,head(.I,1),by=.(P1,P2)]$V1 ]` I'd just go with `unique` from data.table, though (mentioned by akrun above). – Frank Jul 01 '15 at 15:14

score 1 · Answer 3 · answered Jul 01 '15 at 15:06

1

Try this :

dat <- dat[!duplicated(dat[1:2]), ]

answered Jul 01 '15 at 15:06

Shiva

789
6
15

Remove "duplicated" rows from data frame (they differ in few columns)

3 Answers3