I have a dataframe where I want to find the biggest pairs based on two columns. However when I group the dataframe, minor variations on other columns affect my result.
Let me tell you:
library(plyr)
usercsv_data <- data.frame(id_str = c("89797", "12387231231", "1234823432", "3483487344", "89797", "1234823432"),
screen_name = c("A", "B", "C", "D", "A", "C"),
location = c("FL", "CO", "NYC", "MI", "FL", "NYC"),
verified = c("Y", "N", "N", "Y", "N", "Y"),
created = c("Sun", "Mon", "Tue", "Sun", "Tue", "Fri"),
friends_count = c(1,2,5,787,7, 5),
followers_count= c(2,4,6,897,4,3))
# id_str screen_name location verified created friends_count followers_count
# 1 89797 A FL Y Sun 1 2
# 2 12387231231 B CO N Mon 2 4
# 3 1234823432 C NYC N Tue 5 6
# 4 3483487344 D MI Y Sun 787 897
# 5 89797 A FL N Tue 7 4
# 6 1234823432 C NYC Y Fri 5 3
#This gets me the max pairs when the groups variable are unique
plyr::ddply(usercsv_data,.(id_str,screen_name),numcolwise(max))
# id_str screen_name friends_count followers_count
# 1 1234823432 C 5 6
# 2 12387231231 B 2 4
# 3 3483487344 D 787 897
# 4 89797 A 7 4
#BUT, when I want to do same technique with whole dataframe, I get same dataframe
plyr::ddply(usercsv_data,.(id_str,screen_name, location,verified,created),numcolwise(max))
# id_str screen_name location verified created friends_count followers_count
# 1 1234823432 C NYC N Tue 5 6
# 2 1234823432 C NYC Y Fri 5 3
# 3 12387231231 B CO N Mon 2 4
# 4 3483487344 D MI Y Sun 787 897
# 5 89797 A FL N Tue 7 4
# 6 89797 A FL Y Sun 1 2
But I wanted something like this-
# id_str screen_name location verified created friends_count followers_count
# 1 1234823432 C NYC N Tue 5 6
# 3 12387231231 B CO N Mon 2 4
# 4 3483487344 D MI Y Sun 787 897
# 5 89797 A FL N Tue 7 4
How can I group so I maintain all the columns but only keep the rows where largest pairs exist? Currently when there are more group variables, it keeps the unique ones (which is how should be) but not having enough knowledge keeping me search for questions too.