8

I have a dataframe with multiple variables, and I am interested in how to subset it so that it only includes the first duplicate.

    >head(occurrence)
    userId        occurrence  profile.birthday profile.gender postDate count
    1 100469891698         6               47         Female 583 days     0
    2 100469891698         6               47         Female  55 days     0
    3 100469891698         6               47         Female 481 days     0
    4 100469891698         6               47         Female 583 days     0
    5 100469891698         6               47         Female 583 days     0
    6 100469891698         6               47         Female 583 days     0

Here you can see the dataframe. The 'occurrence' column counts how many times the same userId has occurred. I have tried the following code to remove duplicates:

    occurrence <- occurrence[!duplicated(occurrence$userId),]

However, this way it remove "random" duplicates. I want to keep the data which is the oldest one by postDate. So for example the first row should look something like this:

   userId        occurrence  profile.birthday profile.gender postDate count
  1 100469891698         6               47         Female 583 days     0

Thank you for your help!

eagerstudent
  • 237
  • 1
  • 5
  • 14
  • 5
    Welcome to Stack Overflow. `duplicated` marks as `TRUE` all occurrences except the first one (no randomness). However, your data may not be ordered (decreasingly) by `postDate` so you'll need to order it previous to your call or use the alternative way of grouping by `postDate` and keeping only one line per `userId` (you can find [examples of codes to do that in SO Q&A](https://stackoverflow.com/search?q=%5Br%5D+group+by)) – Cath Aug 27 '18 at 12:02

2 Answers2

5

Did you try order first like this:

occurrence <- occurrence[order(occurrence$userId, occurrence$postDate, decreasing=TRUE),]
occurrenceClean <- occurrence[!duplicated(occurrence$userId),]
occurrenceClean
4

You could use dplyr for this and after filtering on the max postDate, use a distinct (unique) to remove all duplicate rows. Of course if there are differences in the rows with max postDate you will get all of those records.

occurrence <- occurrence %>% 
  group_by(userId) %>% 
  filter(postDate == max(postDate)) %>% 
  distinct

  occurence
# A tibble: 1 x 6
# Groups:   userId [1]
        userId occurrence profile.birthday profile.gender postDate count
         <dbl>      <int>            <int> <chr>          <chr>    <int>
1 100469891698          6               47 Female         583 days     0
phiver
  • 23,048
  • 14
  • 44
  • 56
  • 1
    with dplyr::group_by(ocurrence, userId) %>% mutate(row_number() == 1) one can select any record. The value 1 is to select the first record, if you pick 2 it will pick the second. – seakyourpeak Sep 16 '22 at 15:44