Removing Duplicates in R

Question

I have data as such: Observations

where a user has a user_id and a bunch of other features like city and country, and is also associated with multiple advertisers. In this format there is a copy of a row for every advertiser that the user is associated with, and the rest of the features of the user are getting copied in every row with that user_id. I want to de-duplicate the user_ids and combine all the advertiser_ids into one set, yet have all the other features stay the same. For example,

I want the 2nd and 3rd rows to be combined and have the advertiser_id column become a set of ids associated with that user, yet have all the other features stay unchanged.

Since you're new to R, I'd advice you to start by reading [this post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as it'll help you a lot in the future. — Arun, Jun 10 '13 at 21:42
Generally making a dataset grow in the "wide" direction is a bad idea in R, especially when the numbers of observations for the "widened" columns are going to be irregular. Most functions graphics and analytic functions are oriented to handling data in the long format. — IRTFM, Jun 10 '13 at 21:58
Possible duplicate of [How to remove duplicates in R](https://stackoverflow.com/questions/11944402/how-to-remove-duplicates-in-r) — Cœur, Dec 25 '18 at 03:13

eddi · Answer 1 · 2013-06-10T21:50:33.900

Here's a data.table solution:

library(data.table)
#example data
dt = data.table(user_id = c(1,2,2,3), advertiser_id = c(1:4), other_data = c(4:1))
#   user_id advertiser_id other_data
#1:       1             1          4
#2:       2             2          3
#3:       2             3          2
#4:       3             4          1

dt[, advertiser_list := list(list(advertiser_id)), by = user_id][
     # ^^^ first collect advertisers into a list by user_id
     !duplicated(user_id)][, # now select the unique users
     advertiser_id := NULL] -> dt # finally remove the advertiser_id column
dt
#   user_id other_data advertiser_list
#1:       1          4               1
#2:       2          3             2,3
#3:       3          1               4

+1, but I think the OPs problem is even more straightforward---I think *everything* except for the "advertiser_id" is duplicated, meaning even the `data.table` solution could probably be simplified. — A5C1D2H2I1M1N2O1R2T1, Jun 11 '13 at 02:35

score 1 · Answer 2 · answered Jun 10 '13 at 21:56

1

The duplicated() function returns a logical vector which is equal to TRUE for duplicated rows. Lets call df your dataset, you will remove all duplicated values with the following line :

df <- subset(df, duplicated(df) = TRUE)

See the R Programming wikibook if you want to learn more about that.

answered Jun 10 '13 at 21:56

PAC

5,178
8
38
62

Don't you mean != instead of = – Crops Dec 17 '14 at 09:51

score 1 · Answer 3 · answered Jun 11 '13 at 02:33

From your description, it sounds like you're just looking for aggregate. Consider the following:

> df = data.frame(user_id = c(1,2,2,3), 
+                 advertiser_id = c(1:4), 
+                 other_data = letters[c(1, 2, 2, 3)])
> df
  user_id advertiser_id other_data
1       1             1          a
2       2             2          b
3       2             3          b
4       3             4          c
> aggregate(advertiser_id ~ . , df, I)
  user_id other_data advertiser_id
1       1          a             1
2       2          b          2, 3
3       3          c             4

The above converts the "advertiser_id" column into a list, as can be inspected using str. This might be convenient, but might also be difficult to work with, for instance if you wanted to save your output to a csv file later on.

> str(aggregate(advertiser_id ~ . , df, I))
'data.frame':   3 obs. of  3 variables:
 $ user_id      : num  1 2 3
 $ other_data   : Factor w/ 3 levels "a","b","c": 1 2 3
 $ advertiser_id:List of 3
  ..$ 0:Class 'AsIs'  int 1
  ..$ 4:Class 'AsIs'  int [1:2] 2 3
  ..$ 8:Class 'AsIs'  int 4

A less flexible alternative is to concatenate the "advertiser_id" columns as a character string.

> aggregate(advertiser_id ~ . , df, paste, collapse = ", ")
  user_id other_data advertiser_id
1       1          a             1
2       2          b          2, 3
3       3          c             4
> str(aggregate(advertiser_id ~ . , df, paste, collapse = ", "))
'data.frame':   3 obs. of  3 variables:
 $ user_id      : num  1 2 3
 $ other_data   : Factor w/ 3 levels "a","b","c": 1 2 3
 $ advertiser_id: chr  "1" "2, 3" "4"

Both of these can also easily be done with data.table, along the lines of @eddi's answer.

score 0 · Answer 4 · answered Jun 10 '13 at 21:38

If you assume that the user data in all other columns is the same, try:

Assume df is your original data.frane:

#pull add ad_id into one column for each user_id
ad = sapply(unique(df$user_id),function(x){paste(df$advertiser_id[df$user_id==x],collapse=",")}
names(ad) = unique(df$user_id)

#Drop all extra rows
df = df[!duplicated(df[,1]),]

#add a column with combined ad_id
df = cbind(df,ad[df$user_id])

Removing Duplicates in R

4 Answers4