1

I have a dataframe where I want to find the biggest pairs based on two columns. However when I group the dataframe, minor variations on other columns affect my result.

Let me tell you:

library(plyr)

usercsv_data <- data.frame(id_str = c("89797", "12387231231", "1234823432", "3483487344", "89797", "1234823432"),
                           screen_name = c("A", "B", "C", "D", "A", "C"),
                           location = c("FL", "CO", "NYC", "MI", "FL", "NYC"),
                           verified = c("Y", "N", "N", "Y", "N", "Y"),
                           created = c("Sun", "Mon", "Tue", "Sun", "Tue", "Fri"),
                           friends_count = c(1,2,5,787,7, 5),
                           followers_count= c(2,4,6,897,4,3))

#         id_str screen_name location verified created friends_count followers_count
# 1       89797           A       FL        Y     Sun             1               2
# 2 12387231231           B       CO        N     Mon             2               4
# 3  1234823432           C      NYC        N     Tue             5               6
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4
# 6  1234823432           C      NYC        Y     Fri             5               3


#This gets me the max pairs when the groups variable are unique
plyr::ddply(usercsv_data,.(id_str,screen_name),numcolwise(max))

#         id_str screen_name friends_count followers_count
# 1  1234823432           C             5               6
# 2 12387231231           B             2               4
# 3  3483487344           D           787             897
# 4       89797           A             7               4


#BUT, when I want to do same technique with whole dataframe, I get same dataframe
plyr::ddply(usercsv_data,.(id_str,screen_name, location,verified,created),numcolwise(max))

#         id_str screen_name location verified created friends_count followers_count
# 1  1234823432           C      NYC        N     Tue             5               6
# 2  1234823432           C      NYC        Y     Fri             5               3
# 3 12387231231           B       CO        N     Mon             2               4
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4
# 6       89797           A       FL        Y     Sun             1               2

But I wanted something like this-

#         id_str screen_name location verified created friends_count followers_count
# 1  1234823432           C      NYC        N     Tue             5               6
# 3 12387231231           B       CO        N     Mon             2               4
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4

How can I group so I maintain all the columns but only keep the rows where largest pairs exist? Currently when there are more group variables, it keeps the unique ones (which is how should be) but not having enough knowledge keeping me search for questions too.

halfer
  • 19,824
  • 17
  • 99
  • 186
CaseebRamos
  • 684
  • 3
  • 18
  • Hi @RonakShah, updated my question. All other values in dataframe will be associated with the highest pair in `friends_count` & `followers_count` columns. I only want to essentially group the `screen_name` and find the rows with largest combination from the counts columns. So rest are from the rows where those largest values preside. – CaseebRamos Sep 22 '20 at 01:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221842/discussion-between-ultapitt-and-ronak-shah). – CaseebRamos Sep 22 '20 at 01:18

1 Answers1

3

plyr has been retired so we can use dplyr here by creating a column which is sum of friends_count and followers_count and then selecting the max row for each id_str and screen_name.

library(dplyr)

usercsv_data %>%
    mutate(max = rowSums(select(., friends_count, followers_count))) %>%
    group_by(id_str, screen_name) %>%
    slice(which.max(max))

Or without creating the max column.

usercsv_data %>%
  group_by(id_str, screen_name) %>%
  slice(which.max(friends_count + followers_count))

#  id_str      screen_name location verified created friends_count followers_count
#  <chr>       <chr>       <chr>    <chr>    <chr>           <dbl>           <dbl>
#1 1234823432  C           NYC      N        Tue                 5               6
#2 12387231231 B           CO       N        Mon                 2               4
#3 3483487344  D           MI       Y        Sun               787             897
#4 89797       A           FL       N        Tue                 7               4
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213