0

I'm traying to Select only the data that is no duplicate in certain columns(ID, position1, position2 and Name). It doesn't matter that the hour be different, that row will be considered duplicate still. The problem is that I need to mantein the column Hour

|    ID | position1 | position2| Hour |   Name
---------------------------------------------------------
| ABD1  |     54340 |     33254| 10:42|   Adam 
| ABD1  |     54340 |     33254| 10:52|   Adam
| TGF5  |     54545 |     54545| 11:25|   Mike
| ABD1  |     84544 |     24424| 12:06|   Adam
| AGF4  |     25466 |     45444| 10:53|   Mike
| TGF5  |     44787 |     25847| 09:05|   Rose

On the example, we can see that the rows 1 and 2 are duplicated on ID, position1, position2 and Name, so I want to eliminate one of them, doesn't matter wich one. I tryng to do this using

df <- sqldf('SELECT DISTINCT ID, position1, position2, Name FROM df')

The problem is that I lost the column Hour. How can I conser the same columns of df, but only eliminate the rows that are equals in certain columns.

This is the result that I need:

|    ID | position1 | position2| Hour |   Name
---------------------------------------------------------
| ABD1  |     54340 |     33254| 10:42|   Adam 
| TGF5  |     54545 |     54545| 11:25|   Mike
| ABD1  |     84544 |     24424| 12:06|   Adam
| AGF4  |     25466 |     45444| 10:53|   Mike
| TGF5  |     44787 |     25847| 09:05|   Rose  
Matias.10
  • 53
  • 6

1 Answers1

0

How about :

df[!duplicated(df[,c(1,2,3,5)]),]

where 1,2,3,5 are the columns you want to be considered.

Example :

> df
    ID  Pos1  Pos2  Hour Name
1 ABD1 54340 33254 10:42 Adam
2 ABD1 54340 33254 10:52 Adam
3 TGF5 54545 54545 11:25 Mike
4 ABD1 84544 24424 12:06 Adam
5 AGF4 25466 45444 10:53 Mike
6 TGF5 44787 25847 09:05 Rose
> 
> 
> df[!duplicated(df[,c(1,2,3,5)]),]
    ID  Pos1  Pos2  Hour Name
1 ABD1 54340 33254 10:42 Adam
3 TGF5 54545 54545 11:25 Mike
4 ABD1 84544 24424 12:06 Adam
5 AGF4 25466 45444 10:53 Mike
6 TGF5 44787 25847 09:05 Rose
> 

Also:

library(sqldf)
sqldf('SELECT * 
       FROM df 
       GROUP BY ID, position1, position2, Name')
Pierre L
  • 28,203
  • 6
  • 47
  • 69
Ram K
  • 1,746
  • 2
  • 14
  • 23
  • This does not address the sticking point, but rather provides a workaround to avoid learning the sql command. – Pierre L Jul 18 '16 at 21:32