Delete duplicate rows from a data frame based on multiple variable

Question

Hello I'm a student in engineering in France and I have a project for the university. Presently, I would like to remove rows on my data if they are similar within their column's values. My data base looks like that :

node   event   grade    std             date                 groupe           name 
6794   57605    100     659  2016-04-08 10:59:45.882267  cm1_mat_001_eap_001    c8

6794   84007     0      659  2016-04-29 13:44:47.156998  cm1_mat_001_eap_001    c8

6794   86729    100     659  2016-05-02 14:17:02.945516  cm1_mat_001_eap_001    c8

6794   88921    100     659  2016-05-04 09:00:52.157544  cm1_mat_001_eap_001    c8

6797   10119     0      659  2016-05-17 08:27:28.371022  cm1_mat_001_eap_001    c8

6794   98291    100     729  2016-05-12 08:27:13.920052  cm1_mat_001_eap_001    c8

6794   99711    100     729  2016-05-13 06:50:13.60001   cm1_mat_001_eap_001    c8

6812   87995    100     796  2016-05-03 07:33:31.108374  cm1_mat_002_eap_003    c8

Presently, I would like to remove rows if the values within there columns are similar. In my case, if the values in the columns "node" AND "std" are similar I would like to remove the duplicate row based on this condition and keep the first row.

6794   57605    100     659  2016-04-08 10:59:45.882267 cm1_mat_001_eap_001    c8

6797   10119     0      659  2016-05-17 08:27:28.371022  cm1_mat_001_eap_001    c8

6794   98291    100     729  2016-05-12 08:27:13.920052 cm1_mat_001_eap_001    c8

6812   87995    100     796  2016-05-03 07:33:31.108374 cm1_mat_002_eap_003    c8

As you can see the fourth line remained because the condition that I want to create consider duplicate data only if "node" and "std" are similar. And in this case the values of "std" are equal than the previous lines but not but not for the values within "node".

Thanks you for the help. :)

`install.packages("data.table"); data.table::setDT(df); data.table::setkey(df, node, std); unique(df)` — Akhil Nair, Jun 29 '16 at 12:58

score 5 · Accepted Answer · answered Jun 29 '16 at 12:20

5

Using base R,

df[!duplicated(df[c('node', 'std')]),]

answered Jun 29 '16 at 12:20

Sotos

51,121
6
32
66

1

Thanks a lot @Sotos and sorry for the dummies questions. Have a nice day ! :D – Sofiane M'barki Jun 29 '16 at 12:25
Just adding few lines to supplement the answer: #new_uniq will contain unique dataset without the duplicates. new_uniq <- dataset[!duplicated(dataset[c('Date', 'State')]),] View(new_uniq) #Indexes of the duplicate rows that will be removed: duplicate_indexes <- which(duplicated(dataset[c('Date', 'State')]),) duplicate_indexes – Saurabh Jain Nov 01 '17 at 07:07

Delete duplicate rows from a data frame based on multiple variable

1 Answers1