Having trouble keeping all variables after removing duplicates from a dataset

Question

So, I imported a dataset with 178 observations and 8 variables. Then end goal was to eliminate all observations that were the same across three of those variables (2, 5, and 6). This proved quite easy using the unique command.

mav2 <- unique(mav[,c(2,5,6)])

The resulting mav2 dataframe produced 55 observations, getting rid of all the duplicates! Unfortunately, it also got rid of the other five variables that I did not use in the unique command (1,3,4,7, and 8). I initially tried adding the two dataframes, of course this did not work since they were of unequal size. I have also tried merging the two, but this fails and just gives the an output of the first dataset with all 178 observations.

The second dataset (mav2) did produce a new column (row.names) which is the row number for each observation from the initial dataset.

If anyone could help me out on getting all 8 initial variables into a dataset with only the 55 unique observations, I would be very appreciative. Thanks in advance.

If you use a `data.table`, the `unique` function for that has a `by` argument. — Frank, Jun 30 '15 at 20:44
Could you provide a sample `mav` dataset? This makes your question more reproducible: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Paulo MiraMor, Jun 30 '15 at 20:46

score 4 · Accepted Answer · answered Jun 30 '15 at 20:49

4

I think what you want is duplicated, a function similar to unique that returns the indices of the duplicated elements.

So

mav2 <- mav[!duplicated(mav[,c(2,5,6)]),]

EDIT: inverted sense of duplicated

answered Jun 30 '15 at 20:49

user295691

7,108
1
26
35

Thank you very much, that gave me exactly what I wanted! – Jun 30 '15 at 21:32
@Brent in that case, would you mind accepting the answer? – user295691 Jul 01 '15 at 17:38

teucer · Answer 2 · 2015-06-30T21:34:00.023

1

You can try this

mav$key <- 1:nrow(mav)
mav2 <- unique(mav[,c(2,5,6)])
mav_unique <- mav[mav$key%in%mav2$key,]
mav_unique$key <- NULL

EDIT: to address the key issue

 rownames(mav) <- 1:nrow(mav) #to make sure they are correctly set
 mav2 <- unique(mav[,c(2,5,6)])
 mav_unique <- mav[rownames(mav)%in%rownames(mav2),]

edited Jun 30 '15 at 21:34

answered Jun 30 '15 at 20:51

teucer

6,060
2
26
36

score 0 · Answer 3 · answered Jun 30 '15 at 21:02

You can try doing this.

mav[!(mav$v2==mav$v5 & mav$v5==mav$v6),]

Example:

mav <- data.frame(v1=c(1,2,3),v2=c(2,6,4),v3=c(4,5,6),v4=c(1,5,2),v5=c(5,6,7),v6=c(5,6,8),v7=c(7,4,5),v8=c(6,3,1))

mav
  v1 v2 v3 v4 v5 v6 v7 v8
1  1  2  4  1  5  5  7  6
2  2  6  5  5  6  6  4  3
3  3  4  6  2  7  8  5  1

Now in the above dataframe, 2nd row in the columns v2,v5,v6 has same value 6.

Do the following.

mav <- mav[!(mav$v2==mav$v5 & mav$v5==mav$v6),]

gives you

mav
  v1 v2 v3 v4 v5 v6 v7 v8
1  1  2  4  1  5  5  7  6
3  3  4  6  2  7  8  5  1

retains all the other columns.

Having trouble keeping all variables after removing duplicates from a dataset

3 Answers3

Linked