0

How to aggregate at different level on a data.farme.

Example:

Data:

> dput(sampledata)
structure(list(city = c("a", "a", "b", "b", "c", "c"), workerID = c("1", 
"2", "3", "4", "5", "6"), salary = c(50000, 1e+05, 60000, 3e+05, 
40000, 80000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

A solution:

left_join(sampledata, 
            sampledata %>%
              group_by(city) %>%
              summarise(city_level_mean = mean(salary)))

Output:

Joining, by = "city"
# A tibble: 6 × 4
  city  workerID salary city_level_mean
  <chr> <chr>     <dbl>           <dbl>
1 a     1         50000           75000
2 a     2        100000           75000
3 b     3         60000          180000
4 b     4        300000          180000
5 c     5         40000           60000
6 c     6         80000           60000

Any other solution?

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Iman
  • 2,224
  • 15
  • 35

1 Answers1

3

A subtlety in Iman's question is whether there is a technique that can be employed that creates the aggregated data column city_level_mean without having to re-join the summarized data with the original data frame.

If one uses dplyr::mutate() instead of summarise() the data can be aggregated without a left_join.

df <- structure(list(city = c("a", "a", "b", "b", "c", "c"), workerID = c("1", "2", "3", "4", "5", "6"), salary = c(50000, 1e+05, 60000, 3e+05, 40000, 80000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

library(dplyr)

df %>% 
     group_by(city) %>%
     mutate(city_level_mean = mean(salary))

...and the output:

> df %>% 
+      group_by(city) %>%
+      mutate(city_level_mean = mean(salary))
# A tibble: 6 x 4
# Groups:   city [3]
  city  workerID salary city_level_mean
  <chr> <chr>     <dbl>           <dbl>
1 a     1         50000           75000
2 a     2        100000           75000
3 b     3         60000          180000
4 b     4        300000          180000
5 c     5         40000           60000
6 c     6         80000           60000
> 

The same operation using the data.table package looks like this. We create a new column, city_level_mean as the mean of the salary column, using city as a by group.

# data.table approach
library(data.table)
df_dt <- as.data.table(df)
df_dt[,city_level_mean := mean(salary),city]
df_dt

...and the output:

> df_dt <- as.data.table(df)
> df_dt[,city_level_mean := mean(salary),city]
> df_dt
   city workerID salary city_level_mean
1:    a        1  5e+04           75000
2:    a        2  1e+05           75000
3:    b        3  6e+04          180000
4:    b        4  3e+05          180000
5:    c        5  4e+04           60000
6:    c        6  8e+04           60000
Len Greski
  • 10,505
  • 2
  • 22
  • 33