Aggregate data in dataframe

Question

I have the following data frame in R

DeptNumber  EmployeeTypeId
         1              10
         1              11
         1              11
         2              23
         2              23
         2              30
         2              40
         3              45
         3              46

I need to generate another dataframe with a new column MaxEmployeeType, which should contain the EmployeeTypeId which is repeated the most for a given DeptNumber. The output should be as follows

DeptNumber  MaxEmployeeType 
         1              11
         2              23
         3              45

In case of departmentNumber=3, there is a tie, but it is ok to present either of the option. I am not sure what is the optimal way to do this? Any help is appreciated.

A similar question is posted already

How to aggregate data in R with mode (most common) value for each row?

but it had a limitation to use only plyr & lubridate. If possible I want a best solution and not limit to these two packages. The question is even down voted possibly due to that it could be homework.

`aggregate(EmployeeTypeId ~ ., dd, function(x) x[duplicated(x) + (sum(duplicated(x)) == 0)][1])` — rawr, Oct 30 '15 at 03:01
or `aggregate(EmployeeTypeId ~ ., dd, function(x) if (length(y <- x[duplicated(x)])) y else x[1])` — rawr, Oct 30 '15 at 03:07

Steven Beaupré · Accepted Answer · 2015-10-30T03:05:36.033

4

You could try:

library(dplyr)
df %>% 
  count(DeptNumber, EmployeeTypeId) %>% 
  top_n(1) %>% 
  slice(1)

Or as per suggested by @jazzuro:

count(df, DeptNumber, EmployeeTypeId) %>% slice(which(n == max(n))[1])

Which gives:

#Source: local data frame [3 x 3]
#Groups: DeptNumber [3]
#
#  DeptNumber EmployeeTypeId     n
#       (int)          (int) (int)
#1          1             11     2
#2          2             23     2
#3          3             45     1

edited Oct 30 '15 at 03:05

answered Oct 30 '15 at 03:02

Steven Beaupré

21,343
7
57
77

1

In a similar way, I had `count(df, DeptNumber, EmployeeTypeId) %>% slice(which(n == max(n))[1])` – jazzurro Oct 30 '15 at 03:04

narendra-choudhary · Answer 2 · 2015-10-30T03:30:10.663

Try this.

# Mode function
Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}
# new data-frame
new_df <- data.frame("DeptNumber" = numeric(0), "MaxEmployeeType" = numeric(0))
# distinct departments
depts <- unique(df$DeptNumber)
# calculate mode for every department
for(dept in depts){
    dept_set <- subset(df, DeptNumber == dept)
    new_df <- rbind(new_df, c(dept, Mode(dept_set$EmployeeTypeId)))
}

R doesn't have any standard function for calculating Mode. Mode function in the code above is taken from Ken Williams' post here.

bramtayl · Answer 3 · 2015-10-31T15:18:31.063

1

Here is another dplyr solution

library(dplyr)

data %>%
  count(DeptNumber, EmployeeTypeId) %>%
  slice(which.max(n))

edited Oct 31 '15 at 15:18

answered Oct 30 '15 at 03:02

bramtayl

4,004
2
11
18

And instead of `group_by()` and `tally()` we could use `count()`: `count(df, DeptNumber, EmployeeTypeId) %>% slice(which.max(n))` – Steven Beaupré Oct 30 '15 at 21:13

Aggregate data in dataframe

3 Answers3