How to aggregate data in data.frame in R

Question

I have a large data.frame. The data.frame include a lot of values.

For example:

df <- data.frame(Company = c('A', 'A', 'B', 'C', 'A', 'B', 'B', 'C', 'C'), 
                 Name = c("Wayne", "Duane", "William", "Rafael", "John", "Eric", "James", "Pablo", "Tammy"), 
                 Age = c(26, 27, 28, 32, 28, 24, 34, 30, 25), 
                 Wages = c(50000, 70000, 70000, 60000, 50000, 70000, 65000, 50000, 50000), 
                 Education.University = c(1, 1, 1, 0, 0, 1, 1, 0, 1), 
                 Productivity = c(100, 120, 120, 95, 88, 115, 100, 90, 120))

How can I aggregate my data.frame? I want to analyze values on every Company. It must look like:

Age -> average Age of all employees in Company

Wages -> average Wages of all employees in Company

Education.University -> sum of factors (1 or 0) for all employees in Company

Productivity -> average Productivity of all employees in Company

Possible duplicate of [How can I apply different aggregate functions to different columns in R?](https://stackoverflow.com/questions/10702708/how-can-i-apply-different-aggregate-functions-to-different-columns-in-r) — acylam, Aug 15 '17 at 19:27

d.b · Answer 1 · 2017-08-15T19:35:46.687

Base R

cbind(aggregate(.~Company, df[,-c(2, 5)], mean),
      aggregate(Education.University~Company, df, sum)[-1])
#  Company      Age    Wages Productivity Education.University
#1       A 27.00000 56666.67     102.6667                    2
#2       B 28.66667 68333.33     111.6667                    3
#3       C 29.00000 53333.33     101.6667                    1

Here is the longer version that may be easier to understand

merge(x = aggregate(x = list(Age_av = df$Age,
                             Wages_av = df$Wages,
                             Productivity_av = df$Productivity),
                by = list(Company = df$Company),
                FUN = mean),
      y = aggregate(x = list(Education.University_sum = df$Education.University),
                by = list(Company = df$Company),
                FUN = sum),
      by = "Company")
#  Company   Age_av Wages_av Productivity_av Education.University_sum
#1       A 27.00000 56666.67        102.6667                        2
#2       B 28.66667 68333.33        111.6667                        3
#3       C 29.00000 53333.33        101.6667                        1

akrun · Answer 2 · 2017-08-15T19:28:26.087

One option is using data.table

library(data.table)
setDT(df)[, c(lapply(.SD[, c(2:3, 5), with = FALSE], mean), 
    .(Education.University = sum(Education.University))), by = Company]
#   Company      Age    Wages Productivity Education.University
#1:       A 27.00000 56666.67     102.6667                    2
#2:       B 28.66667 68333.33     111.6667                    3
#3:       C 29.00000 53333.33     101.6667                    1

Or with dplyr

library(dplyr)
df %>%
   group_by(Company) %>% 
   mutate(Education.University = sum(Education.University)) %>%
   summarise_if(is.numeric, mean)
# A tibble: 3 x 5
#  Company      Age    Wages Education.University Productivity
#   <fctr>    <dbl>    <dbl>                <dbl>        <dbl>
#1       A 27.00000 56666.67                    2     102.6667
#2       B 28.66667 68333.33                    3     111.6667
#3       C 29.00000 53333.33                    1     101.6667

@akrun thank you! data.table works. is it possible in `dplyr` group_by more values, for example, `Company` and `Name`? — Denis, Aug 15 '17 at 19:34

score 3 · Answer 3 · answered Aug 15 '17 at 19:35

3

You can easily do it by using dplyr library.

library(dplyr)
df %>% group_by(Company) %>% summarise(Age = mean(Age), Wages = mean(Wages), Education.University = sum(Education.University), Productivity = mean(Productivity))

answered Aug 15 '17 at 19:35

Ismail Zekhnini

31
1

score 2 · Answer 4 · edited Jun 20 '20 at 09:12

The concise data.table solution already posted is using column numbers instead of column names. This is considered bad practice according to Frequently Asked Questions about data.table, section 1.1:

If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5.

So, I would like to propose alternative approaches which use column names.

Variant 1

library(data.table)
setDT(df)[, .(average.Age = mean(Age), 
              average.Wages = mean(Wages), 
              sum.Education.University = sum(Education.University),
              average.Productivity = mean(Productivity)), 
          by = Company]

   Company average.Age average.Wages sum.Education.University average.Productivity
1:       A    27.00000      56666.67                        2             102.6667
2:       B    28.66667      68333.33                        3             111.6667
3:       C    29.00000      53333.33                        1             101.6667

Here, every column is aggregated separately. Although it requires more of typing, it has several benefits:

It is easy to understand what the code snippet is intended to do.
The column names in the result can be amended as required.
The order of columns in the result can be controlled if required.

Variant 2

If there are many columns which require the same operations, the data.table FAQ recommends to use .SDcols. So, we can do

m_cols <- c("Age", "Wages", "Productivity")
s_cols <- c("Education.University")
by_cols <- c("Company") 
setDT(df)[, c(.SD[, lapply(.SD, mean), .SDcols = m_cols],
              .SD[, lapply(.SD, sum ), .SDcols = s_cols]),
          by = by_cols]

   Company      Age    Wages Productivity Education.University
1:       A 27.00000 56666.67     102.6667                    2
2:       B 28.66667 68333.33     111.6667                    3
3:       C 29.00000 53333.33     101.6667                    1

This is similar to Akrun's answer but uses column names instead of column numbers. In addition, the column names are stored in a variable which is handy for programming.

Note that by_cols may contain additional columns for aggregation, .e.g,

by_cols <- c("Company", "Name")

If column order matters, we can use setcolorder():

result <- setDT(df)[, c(.SD[, lapply(.SD, mean), .SDcols = m_cols],
                        .SD[, lapply(.SD, sum ), .SDcols = s_cols]),
                    by = by_cols]
setcolorder(result, intersect(names(df), names(result)))
result

   Company      Age    Wages Education.University Productivity
1:       A 27.00000 56666.67                    2     102.6667
2:       B 28.66667 68333.33                    3     111.6667
3:       C 29.00000 53333.33                    1     101.6667

Likewise, the column names of the result can be amended to meet OP's requirements:

setnames(result, m_cols, paste0("average.", m_cols))
setnames(result, s_cols, paste0("sum.", s_cols))
result

   Company average.Age average.Wages sum.Education.University average.Productivity
1:       A    27.00000      56666.67                        2             102.6667
2:       B    28.66667      68333.33                        3             111.6667
3:       C    29.00000      53333.33                        1             101.6667

Note that the data.table functions setcolorder() and setnames() work in place, i.e., without copying the data.table object. This saves memory and time which is of particular importantance when dealing with large tables.

score 2 · Answer 5 · answered Jan 23 '20 at 17:41

2

Just use the "aggregate" function

aggregate(x = df[c("Age","Wages","Education.University","Productivity")], by = df[c("Company")], FUN = mean)

#  Company      Age    Wages Education.University Productivity
#1       A 27.00000 56666.67            0.6666667     102.6667
#2       B 28.66667 68333.33            1.0000000     111.6667
#3       C 29.00000 53333.33            0.3333333     101.6667

answered Jan 23 '20 at 17:41

Brutalroot

311
3
8

OP requires sum for education university and averages for the rest. – zalimgandhera Jan 23 '20 at 17:56
1

@Brutalroot If I wanted add multiple statistics at once could I do that? I want to get mean,min,max, sd, Iqr, Mscore and a few others ignoring NA and make a new df for each statistics – Spooked Oct 23 '20 at 22:01
@user3614197 check this toppic. 'https://stackoverflow.com/questions/12064202/apply-several-summary-functions-on-several-variables-by-group-in-one-call' – Brutalroot Oct 26 '20 at 13:51

How to aggregate data in data.frame in R

5 Answers5

Variant 1

Variant 2

Linked

Related