2

In R, I want to split a data frame along a factor variable, and then apply a function to the data pertaining to each level of that variable. I want to do all of this inside my function. Somehow, the data aren't being split?

I don't understand all of the nuances of passing arguments to functions nested within other functions. I had originally tried to do this with dplyr, but was unable to pass the arguments to dplyr nested within my function.

Here's my function:

 myFun <- function(dat, strat.var, PSU, var1){
     strata <- as.character(unique(dat[, strat.var]))
     N.h <- length(strata)
     sdat <- with(dat, split(dat, strat.var))
     fun1 <- function(x){ length(unique(x[, PSU])) }
     fun2  <- function(x){ sum(tapply(x[, var1], x[, PSU],  mean)) }
     ns <- sapply(sdat, fun1)
     mns <- sapply(sdat, fun2)
     dfx <- data.frame(cbind(stratum=strata, ns=ns, mns=mns))
     return(list(N.h = N.h, out=dfx))
 }

To demonstrate I use the warpbreaks data, but my actual data set has 8 levels of "strat.var" and nested within those are between 2 and 10 levels of "PSU".

    myFun(dat=warpbreaks, strat.var="wool", PSU="tension", var1="breaks")
   # $N.h
   # [1] 2

   # $out
   #   stratum ns              mns
   # 1       A  3 84.4444444444444
   # 2       B  3 84.4444444444444

But this isn't correct, because doing it by hand I get:

 sdat <- with(warpbreaks, split(warpbreaks, wool))
 fun1 <- function(x, PSU){ length(unique(x[, PSU])) }
 fun2 <- function(x, PSU, var1){ sum(tapply(x[, var1], x[, PSU], mean)) }
 sapply(sdat, fun1, PSU="tension") 
# A B 
# 3 3 
 sapply(sdat, fun2, PSU="tension", var1="breaks") 
#       A        B 
# 93.11111 75.77778

I'm using sapply() because of posts like this one and this one. And I'm not using subset() because I couldn't get it to work. I'm aslo open to any suggestions using dplyr().

Thanks in advance for any and all help!

Community
  • 1
  • 1
NotYourIPhone Siri
  • 715
  • 1
  • 8
  • 11
  • Is the variable you're trying to compute the (numerical) mean of a factor (like warpbreaks$tension), or numerical? You can compute the numerical mean of a factor's levels, but it doesn't mean anything. – smci Mar 06 '15 at 05:37
  • `as.character(unique(dat[, strat.var]))` is just an obfuscation for `labels(dat[, strat.var])` on your group_by variable. And the rest is obfuscated `group_by` and `summarize(newvar = mean(var))` – smci Mar 06 '15 at 05:38
  • @smci, when I use `labels(dat[, strat.var])` I get a vector of 1:54, which is `nrow(warpbreaks)`. What am I missing? – NotYourIPhone Siri Mar 06 '15 at 17:58
  • Doh! I meant `levels(dat[, strat.var])` Levels, not labels. – smci Mar 07 '15 at 01:19
  • I started implementing this in dplyr, but wanted you to confirm what you are doing, it does not seem to make any physical sense: first you `group_by(strat.var)`, then you hierarchically do another `group_by(PSU)`, and `summarize(mns = mean(var1))`, then you ungroup (just the split on PSU) and summarize with the sum of those individual means, then you ungroup again. Correct? – smci Mar 07 '15 at 13:34

1 Answers1

7

You can replace

 sdat <- with(dat, split(dat, strat.var))

with

sdat <- split(dat, dat[strat.var])

in the myFun.

The previous code was not splitting as it was intended, instead you were getting the sum for the whole data, i.e.

sum(with(warpbreaks, tapply(breaks, tension, FUN=mean)))
#[1] 84.44444

Using the corrected myFun

myFun(warpbreaks, strat.var='wool', PSU='tension', var1='breaks')
#$N.h
#[1] 2

#$out
#  stratum ns              mns
#A       A  3 93.1111111111111
#B       B  3 75.7777777777778

You could also create a function using dplyr (you can fine-tune the below one)

library(lazyeval)
library(dplyr)
myFun2 <- function(dat, strat.var, PSU, var1) {
   dat %>%
      mutate_(N.h = interp(~n_distinct(var),
               var = as.name(strat.var))) %>% 
      group_by_(.dots=strat.var) %>%
      mutate_(ns = interp(~n_distinct(var), var=as.name(PSU))) %>% 
      group_by_(.dots=PSU, add=TRUE) %>% 
      mutate_(mns=interp(~mean(var), var=as.name(var1))) %>%  
      select_(.dots= list(strat.var, 'ns', 'N.h', 'mns')) %>%
      unique() %>%
      group_by_(.dots=strat.var, 'ns', 'N.h') %>% 
      summarise(mns=sum(mns))                  
 }

myFun2(warpbreaks, 'wool', 'tension', 'breaks')
#Source: local data frame [2 x 4]
#Groups: ns, N.h

#  ns N.h wool      mns
#1  3   2    A 93.11111
#2  3   2    B 75.77778
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The dplyr implementation is much cleaner than that, but first we need the OP to confirm this is really what they want to do, since it doesn't make physical sense to sum means calculated across a split by levels of a factor. – smci Mar 07 '15 at 13:36
  • @smci It could be, I was just giving some ideas to the OP. The main problem seems to be fixing his function. – akrun Mar 07 '15 at 14:57