0

I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).

So the initial split goes like this:

spl <- split(mtcars, mtcars$cyl)

The result of which looks something like this:

$`4`
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
...

$`6`
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
...

$`8`
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
...

Now I want to do something along the lines of:

df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))

However, doing the above results in:

            X4       X6   X8
1 26.66364 19.74286 15.1

While I'd want the df to be like this:

  cyl mean_mpg
1   4 26.66364
2   6 19.74286
3   8 15.10000

Thanks, J.

Jaroslav
  • 25
  • 1
  • 3

1 Answers1

1

If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:

1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying @alistaire's code slightly:

aggregate(mpg ~ cyl, mtcars, mean)

2) split/lapply/do.call Also @rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:

spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))

3) do.call/by The last one could alternately be rewritten in terms of by:

do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))

4) split/lapply/unsplit Another possibility is to use split and unsplit:

spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))

4a) or if row names are sufficient:

spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))

The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:

5) dplyr

library(dplyr)
mtcars %>%
       group_by(cyl) %>%
       summarize(mpg = mean(mpg)) %>%
       ungroup()

6) data.table

library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]

7) sqldf

library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341