1

I have a dataframe that I would like to condense by removing duplicates, but only of a certain variable. In the example below, I would only like to remove duplicates of user_id when the plan_type = subscriber. The output of is shown below of how the sample data should be condensed.

I have tried unique() but it will not work because there may be multiple occurrences of the same user_id where plan_type = PPG and this data should remain.

Any suggestions that do not include multiple steps of subsetting and then rebinding two dataframes?

> foo
      user_id  plan_type
16435    6264 subscriber
31518   10050 subscriber
31520   10050 subscriber
7576    11174 subscriber
19744   11186 subscriber
19745   11186 subscriber
46108   20348 subscriber
5293    31641 subscriber
5294    31641 subscriber
5295    31641        PPU


> output
      user_id  plan_type
16435    6264 subscriber
31520   10050 subscriber
7576    11174 subscriber
19745   11186 subscriber
46108   20348 subscriber
5294    31641 subscriber
5295    31641        PPU

> dput(foo)
structure(list(user_id = c(6264L, 10050L, 10050L, 11174L, 11186L, 
11186L, 20348L, 31641L, 31641L, 31641L), plan_type = c("subscriber", 
"subscriber", "subscriber", "subscriber", "subscriber", "subscriber", 
"subscriber", "subscriber", "subscriber", "PPU")), .Names = c("user_id", 
"plan_type"), row.names = c(16435L, 31518L, 31520L, 7576L, 19744L, 
19745L, 46108L, 5293L, 5294L, 5295L), class = "data.frame")
iskandarblue
  • 7,208
  • 15
  • 60
  • 130
  • Possible duplicate of http://stackoverflow.com/questions/24011246/deleting-rows-that-are-duplicated-in-one-column-based-on-the-conditions-of-anoth – akrun Apr 03 '17 at 13:18

3 Answers3

6

You want to subset only those observations where user_id is not duplicated OR plan_type is not "subscriber":

foo[!duplicated(foo$user_id) | foo$plan_type != "subscriber", ]

Using dplyr, this would be

library(dplyr)
foo %>% filter(!duplicated(user_id) | plan_type != "subscriber")
talat
  • 68,970
  • 21
  • 126
  • 157
  • would just point out, your `dplyr` example keeps the first instance, while given example keeps the last of duplicated instances. This doesn't seem to be very important here, but may matter in certain scenarios. – Aramis7d Apr 03 '17 at 13:16
  • @Aramis7d, OP didn't specify what particular row should be kept in case of duplicates that shall be removed so I'll just stick to my answer unless OP makes other specifications. In the end, it boils down to ordering the data before this step. – talat Apr 03 '17 at 13:21
  • true. I inferred it from his example. Might be a total coincidence. – Aramis7d Apr 03 '17 at 13:23
1

We can create a logic

subset(foo, (!duplicated(user_id) & plan_type == "subscriber")|
         plan_type %in% setdiff(unique(plan_type), "subscriber"))
#       user_id  plan_type
#16435    6264 subscriber
#31518   10050 subscriber
#7576    11174 subscriber
#19744   11186 subscriber
#46108   20348 subscriber
#5293    31641 subscriber
#5295    31641       PPU

Or with data.table

library(data.table)
rbind(unique(setDT(foo), by = "user_id"), foo[plan_type!= "subscriber"])
#    user_id  plan_type
#1:    6264 subscriber
#2:   10050 subscriber
#3:   11174 subscriber
#4:   11186 subscriber
#5:   20348 subscriber
#6:   31641 subscriber
#7:   31641        PPU
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Considering your input as df

df <-  read.table(text = '  indx    user_id  plan_type
16435    6264 subscriber
31518   10050 subscriber
31520   10050 subscriber
7576    11174 subscriber
19744   11186 subscriber
19745   11186 subscriber
46108   20348 subscriber
5293    31641 subscriber
5294    31641 subscriber
5295    31641        PPU', header = T, stringsAsFactors = F)

you can try:

df %>% 
  group_by(plan_type, user_id) %>%
  slice(which.max(indx))

which gives:

Source: local data frame [7 x 3]
Groups: plan_type, user_id [7]

   indx user_id  plan_type
  <int>   <int>      <chr>
1  5295   31641        PPU
2 16435    6264 subscriber
3 31520   10050 subscriber
4  7576   11174 subscriber
5 19745   11186 subscriber
6 46108   20348 subscriber
7  5294   31641 subscriber

you may also include the plan_type == "subscriber" filter if you want to, though it doesn't make a difference per the given example.

This can be done as:

df %>% 
  filter( plan_type == "subscriber") %>%
  group_by(user_id) %>%
  slice(which.max(indx)) %>%
  bind_rows(df %>% filter(plan_type != "subscriber"))
Aramis7d
  • 2,444
  • 19
  • 25
  • 1
    This will fail in OP's real data since they specified `there may be multiple occurrences of the same user_id where plan_type = PPG and this data should remain` – talat Apr 03 '17 at 13:19
  • and I did mention OP can include a filter for that, to fit his actual data :) . I designed my solution considering the example to be a complete representation of the actual set. But of course, if the inputs change, the solutions must change accordingly. – Aramis7d Apr 03 '17 at 13:21
  • I see what you mean, but in this case, OP already specified additional requirements in his post so not taking these into account seems wrong to me – talat Apr 03 '17 at 13:22
  • Fair enough. I guess OP should ideally change the example to include all possible scenarios. Anyways, i made some changes as well :) – Aramis7d Apr 03 '17 at 13:30