4

I would like calculate the most frequent factor level by category with plyr using the code below. The data frame b shows the requested result. Why does c$mlevels only have the value "numeric"?

require(plyr)
set.seed(0)
a <- data.frame(cat=round(runif(100, 1, 3)),
                levels=factor(round(runif(100, 1, 10))))
mode <- function(x) names(table(x))[which.max(table(x))]
b <- data.frame(cat=1:3,
                mlevels=c(mode(a$levels[a$cat==1]),
                       mode(a$levels[a$cat==2]),
                       mode(a$levels[a$cat==3])))
c <- ddply(a, .(cat), summarise,
           mlevels=mode(levels))
naund
  • 43
  • 5

2 Answers2

5

When you use summarise, plyr seems to "not see" the function declared in the global environment before checking for function in base:

We can check this using Hadley's handy pryr package. You can install it by these commands:

library(devtools)
install_github("pryr")


require(pryr)
require(plyr)
c <- ddply(a, .(cat), summarise, print(where("mode")))
# <environment: namespace:base>
# <environment: namespace:base>
# <environment: namespace:base>

Basically, it doesn't read/know/see your mode function. There are two alternatives. The first is what @AnandaMahto suggested and I'd do the same and would advice you to stick with it. The other alternative is to not use summarise and call it using function(.) so that the mode function in your global environment is "seen".

c <- ddply(a, .(cat), function(x) mode(x$levels))
#   cat V1
# 1   1  6
# 2   2  5
# 3   3  9

Why does this work?

c <- ddply(a, .(cat), function(x) print(where("mode")))
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>

Because as you see above, it reads your function that sits in the global environment.

> mode # your function
# function(x)
#     names(table(x))[which.max(table(x))]
> environment(mode) # where it sits
# <environment: R_GlobalEnv>

as opposed to:

> base::mode # base's mode function
# function (x) 
# {
#     some lines of code to compute mode
# }
# <bytecode: 0x7fa2f2bff878>
# <environment: namespace:base>

Here's an awesome wiki on environments from Hadley if you're interested in giving it a reading/exploring further.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    +1, This is a *much* better answer than mine. I was too lazy to explore where plyr was looking for "mode" so I just renamed and simplified it. – A5C1D2H2I1M1N2O1R2T1 Mar 02 '13 at 10:48
  • Thank you for answer, it is working now. I try to summarize: Mode is already existing in the global environment and therefore plyr is using this one insted of the mode function provided by me. – naund Mar 02 '13 at 11:20
  • @user2126360, `mode` exists in `base`. – A5C1D2H2I1M1N2O1R2T1 Mar 02 '13 at 11:22
  • nope. your function exists in global environment. but `summarise` "sees" a different `mode` which is by default in `base`. that's why @AnandaMahto explained in his edit that it is not wise to use all default names used already in R as there are more chances for your code to break. – Arun Mar 02 '13 at 11:23
  • 1
    See also the `here` function: `ddply(a, .(cat), here(summarise), mlevels=mode(levels))`. The explanation in the docs should help. – hadley Mar 02 '13 at 14:35
2

You have pretty much exclusively used existing function names in your example: levels, cat, and mode. Generally, that doesn't create much of a problem--for example, calling a data.frame "df" doesn't break R's df() function. But it almost always leads to more ambiguous or confusing code, and in this case, it made things "break". Arun's answer does a great job of showing why.

You can easily fix your problem by renaming your "mode" function. In the example below, I've simplified it a little bit in addition to renaming it, and it works as you expected.

Mode <- function(x) names(which.max(table(x)))
ddply(a, .(cat), summarise,
      mlevels=Mode(levels))
#   cat mlevels
# 1   1       6
# 2   2       5
# 3   3       9

Of course, there's a really cumbersome workaround: Use get and specify where to search for the function.

> mode <- function(x) names(table(x))[which.max(table(x))]
> ddply(a, .(cat), summarise, mlevels = get("mode", ".GlobalEnv")(levels))
  cat mlevels
1   1       6
2   2       5
3   3       9
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485