2

Say I have a dataframe D1 with 4 columns:

  • F1 contains factors [A, B]
  • F2 contains factors [P, Q]
  • F3 contains factors [X, Y]
  • F4 contains numeric values

How can I transform this into a new three columned dataframe, D2:

  • F1 and F2 as before
  • F3 contains the mean of all values in previous rows averaged over all levels of [X, Y, Z] and store this in a new dataframe.

I know how to get the mean per single factor level, but I would like to do this for the cross product of two other levels.

example given the dataframe:

F1  F2  F3  F4
 A   P   X   2
 A   P   Y   4
 A   Q   X   3
 A   Q   Y   5
 B   P   X   1.5
 B   P   Y   2.5
 B   Q   X   0
 B   Q   Y   1

the following dataframe would be outputed:

F1  F2  F3
 A   P   3
 A   Q   4
 B   P   2
 B   Q   0.5

Ideally this should work irrespective of the number of levels of factors invovled

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
Tom
  • 6,601
  • 12
  • 40
  • 48

2 Answers2

2

We can try

library(data.table)
setDT(df1)[, list(F3=mean(F4)) , .(F1, F2)]
#    F1 F2  F3
#1:  A  P 3.0
#2:  A  Q 4.0
#3:  B  P 2.0
#4:  B  Q 0.5

library(dplyr)
df1 %>%
    group_by(F1, F2) %>%
    summarise(F3= mean(F4))
#  F1    F2    F3
#  (chr) (chr) (dbl)
#1     A     P   3.0
#2     A     Q   4.0
#3     B     P   2.0
#4     B     Q   0.5

aggregate(F4~F1+F2, df1, mean)
akrun
  • 874,273
  • 37
  • 540
  • 662
0

A potential solution with library dplyr is:

library(dplyr)
D1 <- data.frame(F1 = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 F2 = c('P', 'P', 'Q', 'Q', 'P', 'P', 'Q', 'Q'),
                 F3 = c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
                 F4 = c(2, 4, 3, 5, 1.5, 2.5, 0, 1),
                 stringsAsFactors = FALSE)
D2 <- D1 %>% group_by(F1, F2) %>% summarise(F3 = mean(F4))

This will replace column F3 with the mean.

The output is as follows:

D2
Source: local data frame [4 x 3]
Groups: F1 [?]

     F1    F2    F3
  (chr) (chr) (dbl)
1     A     P   3.0
2     A     Q   4.0
3     B     P   2.0
4     B     Q   0.5
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • 1
    I meant that you are working with the original data.frame, while their output is an aggregated one, with only four rows. Oh, I see you've switched to `summarise`, but then this repeats part of akrun's answer. – Frank Jan 11 '16 at 17:50
  • This is the same as my solution posted several mins earlier – akrun Jan 11 '16 at 17:54
  • His was first based on data table. Mine was dplyr. He just added dplyr answer. – Gopala Jan 11 '16 at 17:54
  • 4
    I added dplyr solution before you posted the solution – akrun Jan 11 '16 at 17:54