Aggregate with multiple duplicates and calculate their mean

Question

Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))

The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:

UserID  Name     Class    Scoring_mean  Scoring_std
101     Ed       Junior   12.5          3
101     Hank     Junior   24.67         11.62
102     Sandy    High     24.75         6.29
102     Jessica  High     24.25         1.5

Hence my question:

What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )

In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.

At the same time I couldn't find any other post that is somewhat related to this question, as in:

Group on `paste(UserID, Name)`? – IRTFM Mar 30 '18 at 23:39 — IRTFM, Mar 30 '18 at 23:39

score 1 · Accepted Answer · answered Mar 30 '18 at 22:14

Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:

DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), 
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)

library(tidyverse)

DF %>%
  gather(score_type, score, Scoring, Other_Scores) %>%  # reshape score columns
  group_by(ID, Name, Class) %>%                         # group by combinations
  summarise(scoring_mean = mean(score),                 # get summary stats
            scoring_sd = sd(score)) %>%
  ungroup()                                             # forget the grouping

# # A tibble: 4 x 5
#       ID Name    Class  scoring_mean scoring_sd
#    <dbl> <fct>   <fct>         <dbl>      <dbl>
# 1  101. Ed      Junior         12.5       3.00
# 2  101. Hank    Junior         24.7      11.6 
# 3  102. Jessica Mid            24.2       1.50
# 4  102. Sandy   High           24.8       6.29

score 1 · Answer 2 · answered Mar 30 '18 at 22:29

What about computing your summary stats then joining the results to your initial dataframe. Like so:

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
                 Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
                 Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
                 Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))


DF2 <- DF %>% group_by(Name) %>%
  summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
  left_join(DF[,c(1,2,3)], by="Name")

Giving:

# A tibble: 9 x 5
  Name    scoring_mean scoring_sd    ID Class 
  <fct>          <dbl>      <dbl> <dbl> <fct> 
1 Ed              13.0      2.83   101. Junior
2 Ed              13.0      2.83   101. Junior
3 Hank            16.0      3.46   101. Junior
4 Hank            16.0      3.46   101. Junior
5 Hank            16.0      3.46   101. Junior
6 Jessica         25.5      0.707  102. Mid   
7 Jessica         25.5      0.707  102. Mid   
8 Sandy           21.0      1.41   102. High  
9 Sandy           21.0      1.41   102. High

This answer is not appropriate in these kind of situations, since it does not concatenate the observations, which is what I needed in the first place. Maybe my question is not clear in the post, so from my end my apologies. — Wokkel, Apr 01 '18 at 09:11

Aggregate with multiple duplicates and calculate their mean

2 Answers2