Remove "semi-duplicate" rows in R

Question

I have a dataset that looks like this:

  text                 id     screenName retweetCount isRetweet retweeted longitude latitude
1    xx 778980737861062656  0504Traveller            0     FALSE     FALSE      <NA>     <NA>
2    xx 778967536167559168       Iz_Azman            0     FALSE     FALSE      <NA>     <NA>
3    yy 778962265298960384       Iz_Azman            0     FALSE     FALSE      <NA>     <NA>
4    yy 778954988122939392 travelindtoday            2     FALSE     FALSE      <NA>     <NA>
5    zz 778948691969224705           umtn            2     FALSE     FALSE      <NA>     <NA>
6    zz 778942095843135493     flyinsider            0     FALSE     FALSE      <NA>     <NA>

These are tweets from the package twittR in R. Some tweets have exactly the same text but different retweetCount. I want to keep the unique tweets (by text), but keeping those with the highest retweetCount amongst duplicates. (In the case above, tweets 1, 4, and 5.)

How do I do that?

Ronak Shah · Accepted Answer · 2016-09-22T18:35:55.917

You can do this with dplyr

library(dplyr)
df %>%
  group_by(text) %>%
  slice(which.max(retweetCount))

#text           id     screenName retweetCount isRetweet retweeted longitude latitude
#(fctr)        (dbl)         (fctr)        (int)     (lgl)     (lgl)    (fctr)   (fctr)
#1  xx      7.789807e+17  0504Traveller       0     FALSE     FALSE      <NA>     <NA>
#4  yy      7.789550e+17 travelindtoday       2     FALSE     FALSE      <NA>     <NA>
#5  zz      7.789487e+17           umt        2     FALSE     FALSE      <NA>     <NA>

Another approach in base R using aveand order would be :

df[ave(df$retweetCount,df$text, FUN = function(x) order(x, decreasing = T)) == 1, ]

# text           id     screenName retweetCount isRetweet retweeted longitude latitude
#1  xx   7.789807e+17  0504Traveller            0     FALSE     FALSE      <NA>     <NA>
#4  yy   7.789550e+17 travelindtoday            2     FALSE     FALSE      <NA>     <NA>
#5  zz   7.789487e+17           umtn            2     FALSE     FALSE      <NA>     <NA>

Good answer. Can use `arrange` instead of `which.max` if you want more than just the top 1, or `n_tile` if you want to discretize the top percentile of grouped retweets. — shayaa, Sep 22 '16 at 18:13
@shayaa Correct! The first thought which came in my mind was to use `top_n` but even that would return both the rows in `xx` — Ronak Shah, Sep 22 '16 at 18:16

Sandipan Dey · Answer 2 · 2016-09-23T05:23:10.717

How about this?

df

    text           id     screenName retweetCount isRetweet retweeted longitude latitude
1   xx 7.789807e+17  0504Traveller            0     FALSE     FALSE      <NA>     <NA>
2   xx 7.789675e+17       Iz_Azman            0     FALSE     FALSE      <NA>     <NA>
3   yy 7.789623e+17       Iz_Azman            0     FALSE     FALSE      <NA>     <NA>
4   yy 7.789550e+17 travelindtoday            2     FALSE     FALSE      <NA>     <NA>
5   zz 7.789487e+17           umtn            2     FALSE     FALSE      <NA>     <NA>
6   zz 7.789421e+17     flyinsider            0     FALSE     FALSE      <NA>     <NA>

merge(df, aggregate(retweetCount~text, df, FUN=max), by=c('text', 'retweetCount'))        

    text retweetCount           id     screenName isRetweet retweeted longitude latitude
1   xx            0 7.789807e+17  0504Traveller     FALSE     FALSE      <NA>     <NA>
2   xx            0 7.789675e+17       Iz_Azman     FALSE     FALSE      <NA>     <NA>
3   yy            2 7.789550e+17 travelindtoday     FALSE     FALSE      <NA>     <NA>
4   zz            2 7.789487e+17           umtn     FALSE     FALSE      <NA>     <NA>

If we want to remove duplicated retweetCounts, then we can try the following:

merge(df[!duplicated(df[c('text', "retweetCount")]),], 
       aggregate(retweetCount~text, df, FUN=max), by=c('text', 'retweetCount'))

    text retweetCount           id     screenName isRetweet retweeted longitude latitude
1   xx            0 7.789807e+17  0504Traveller     FALSE     FALSE      <NA>     <NA>
2   yy            2 7.789550e+17 travelindtoday     FALSE     FALSE      <NA>     <NA>
3   zz            2 7.789487e+17           umtn     FALSE     FALSE      <NA>     <NA>

Oh I see if that's the requirement, we can change the code a bit, updated above. — Sandipan Dey, Sep 23 '16 at 05:16

Remove "semi-duplicate" rows in R

2 Answers2