218

I have read a CSV file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in that column. For example:

platform_external_dbus          202           16                     google        1
platform_external_dbus          202           16         space-ghost.verbum        1
platform_external_dbus          202           16                  localhost        1
platform_external_dbus          202           16          users.sourceforge        8
platform_external_dbus          202           16                    hughsie        1

I would like only one of these rows since the others have the same data in the first column.

Jaap
  • 81,064
  • 34
  • 182
  • 193
user1897691
  • 2,331
  • 3
  • 14
  • 12
  • 3
    which one do you want? just the first? in other words: do you want to keep `google` or `localhost` or `hughsie` ? – Anthony Damico Dec 20 '12 at 07:18
  • It does not matter for this part of my statistical analysis. I am only trying to relate the project title (first column), the number of bugs (second column), and the number of organizations on the project (third column). – user1897691 Dec 20 '12 at 07:20
  • 3
    cool. throw unnecessary columns out and use ?unique – Anthony Damico Dec 20 '12 at 07:22

10 Answers10

289

For people who have come here to look for a general answer for duplicate row removal, use !duplicated():

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

duplicated(df)
[1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

> df[duplicated(df), ]
  a b
2 A 1
6 B 1
8 C 2

> df[!duplicated(df), ]
  a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2

Answer from: Removing duplicated rows from R data frame

By default this method will keep the first occurrence of each duplicate. You can use the argument fromLast = TRUE to instead keep the last occurrence of each duplicate. You can sort your data before this step so that it keeps the rows you want.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Mehdi Nellen
  • 8,486
  • 4
  • 33
  • 48
  • I want to create a new varibale that flags if there's a duplicate *on a certain variable* almost like df$duplicates <- ifelse(this rows value in column a == previous row value in column a , 1 , 0) – jacob Jul 22 '15 at 09:59
  • @jacob see this question http://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows – dpel May 23 '16 at 15:16
  • 2
    This keeps the first appeared value and removes the rest of the duplicates, right? Or it removes values randomly? – MLE Aug 07 '17 at 00:43
  • @alphabetagamma yes, it keeps the first appeared value – Mehdi Nellen Jan 31 '19 at 09:58
  • 6
    If you are only interested in duplicates in certain columns, say cols 1 and 2, we can use `df[!duplicated(df[, 1:2])]` – qwr Jun 03 '19 at 15:53
  • this seems more verbose than using unique(). Any reason it has more upvotes? – nateroe Sep 24 '21 at 00:26
  • `df` is the name of a base R function... It would be better to use `my_df` as the name of the data frame – Antoine Jul 24 '22 at 15:03
  • @nateroe `duplicated()` is more general than `unique()`. As qwr's comment shows, with `duplicated` you can specify which columns to check for duplicates. With `unique` that is not so easily done. (With `unique()` you can keep the unique rows for only that set of columns, but it would take more work if you want to bring the other columns along as well.) – Gregor Thomas Dec 14 '22 at 20:51
230

just isolate your data frame to the columns you need, then use the unique function :D

# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them, 
# so they're duplicates and thrown out.
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 1
    This looks like it will work perfectly. Can you please explain to me what is happening with the `[,1:3]` part of that code? I'm new to R which is why I'm asking what I can only assume is an obvious question. – user1897691 Dec 20 '12 at 07:24
  • 6
    @user1897691 mark it as correct then ;) [watch this](http://www.screenr.com/fCs8) and if you like that, check [twotorials.com](http://twotorials.com) – Anthony Damico Dec 20 '12 at 07:25
  • 8
    Do note that this will remove all the columns except for the three first ones. – GuillaumeL Apr 01 '19 at 13:47
118

The function distinct() in the dplyr package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr is part of the tidyverse.

Data and package

library(dplyr)
dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))

Remove rows duplicated in a specific column (e.g., columna)

Note that .keep_all = TRUE retains all columns, otherwise only column a would be retained.

distinct(dat, a, .keep_all = TRUE)

  a b
1 1 A
2 2 B

Remove rows that are complete duplicates of other rows:

distinct(dat)

  a b
1 1 A
2 2 B
3 1 C
4 2 D
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • Great answer, by the way, `.keep_all` is for whether to keep all the columns, not to be mixed with `keep` in `pandas`. – Jia Gao Jul 23 '19 at 03:34
34

The data.table package also has unique and duplicated methods of it's own with some additional features.

Both the unique.data.table and the duplicated.data.table methods have an additional by argument which allows you to pass a character or integer vector of column names or their locations respectively

library(data.table)
DT <- data.table(id = c(1,1,1,2,2,2),
                 val = c(10,20,30,10,20,30))

unique(DT, by = "id")
#    id val
# 1:  1  10
# 2:  2  10

duplicated(DT, by = "id")
# [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Another important feature of these methods is a huge performance gain for larger data sets

library(microbenchmark)
library(data.table)
set.seed(123)
DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
DT <- copy(DF)
setDT(DT)

microbenchmark(unique(DF), unique(DT))
# Unit: microseconds
#       expr       min         lq      mean    median        uq       max neval cld
# unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18   100   b
# unique(DT)   746.855   776.6145  2201.657   864.932   919.489  55986.88   100  a 


microbenchmark(duplicated(DF), duplicated(DT))
# Unit: microseconds
#           expr       min         lq       mean     median        uq        max neval cld
# duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170   100   b
# duplicated(DT)   551.982   558.2215   851.0246   639.9795   663.658   5805.243   100  a 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
16

the general answer can be for example:

df <-  data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))



new_df <- df[-which(duplicated(df)), ]

output:

      X1 X2 X3
    1  2  9  6
    2  4  6  7
Amit Gupta
  • 2,698
  • 4
  • 24
  • 37
  • 4
    Be cautious when using `-which`, this will lead to error if there are no duplicates, use `df[!(duplicated(df)), ]` may be safer. – Jia Gao Jan 19 '19 at 17:23
10

Here's a very simple, fast dplyr/tidy solution:

Remove rows that are entirely the same:

library(dplyr)
iris %>% 
  distinct(.keep_all = TRUE)

Remove rows that are the same only in certain columns:

iris %>% 
  distinct(Sepal.Length, Sepal.Width, .keep_all = TRUE)

stevec
  • 41,291
  • 27
  • 223
  • 311
  • 2
    This is a good approach and can remain in pipes. One note for the first option above, you don't need .keep_all = TRUE, as leaving distinct unqualified, it evaluates the entire data frame to my knowledge. – glenn_in_boston Jun 06 '22 at 15:27
6

With sqldf:

# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

Solution:

 library(sqldf)
    sqldf('SELECT DISTINCT * FROM df')

Output:

  a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2
mpalanco
  • 12,960
  • 2
  • 59
  • 67
  • This has the overhead of setting up an entire SQL database. https://cran.r-project.org/web/packages/sqldf/index.html – qwr Nov 29 '18 at 20:14
  • 2
    What do you mean by setting up an entire SQL database? That is one of the main advantages: 'with sqldf the user is freed from having to do the following, all of which are automatically done: database setup, writing the create table statement which defines each table, importing and exporting to and from the database'. It is not an optimal solution, but handy for those familiar with SQL. – mpalanco Nov 30 '18 at 07:52
4

Remove duplicate rows of a dataframe

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe
distinct(mydata)

In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.



Remove Duplicate Rows based on a one variable

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe using carb variable
distinct(mydata,carb, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame.



Remove Duplicate Rows based on multiple variables

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe using cyl and vs variables
distinct(mydata, cyl,vs, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame.

(from: http://www.datasciencemadesimple.com/remove-duplicate-rows-r-using-dplyr-distinct-function/ )

vasili111
  • 6,032
  • 10
  • 50
  • 80
3

Or you could nest the data in cols 4 and 5 into a single row with tidyr:

library(tidyr)
df %>% nest(V4:V5)

# A tibble: 1 × 4
#                      V1    V2    V3             data
#                  <fctr> <int> <int>           <list>
#1 platform_external_dbus   202    16 <tibble [5 × 2]>

The col 2 and 3 duplicates are now removed for statistical analysis, but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with unnest().

Joe
  • 8,073
  • 1
  • 52
  • 58
1

This problem can also be solved by selecting first row from each group where the group are the columns based on which we want to select unique values (in the example shared it is just 1st column).

Using base R :

subset(df, ave(V2, V1, FUN = seq_along) == 1)

#                      V1  V2 V3     V4 V5
#1 platform_external_dbus 202 16 google  1

In dplyr

library(dplyr)
df %>% group_by(V1) %>% slice(1L)

Or using data.table

library(data.table)
setDT(df)[, .SD[1L], by = V1]

If we need to find out unique rows based on multiple columns just add those column names in grouping part for each of the above answer.

data

df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = "platform_external_dbus", class = "factor"), 
V2 = c(202L, 202L, 202L, 202L, 202L), V3 = c(16L, 16L, 16L, 
16L, 16L), V4 = structure(c(1L, 4L, 3L, 5L, 2L), .Label = c("google", 
"hughsie", "localhost", "space-ghost.verbum", "users.sourceforge"
), class = "factor"), V5 = c(1L, 1L, 1L, 8L, 1L)), class = "data.frame", 
row.names = c(NA, -5L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213