2

Using R, is there a way to take a data set and map out every possible combination of every categorical variable?

For example, let's say I had 10,000 rows of customer data from an online shop. I want to find out which shoppers spend the most money. I have the following data from the customer:

E-mail (yahoo, gmail, hotmail, aol) Browser (Mozilla, IE, Chrome, Opera) Country (USA, Canada, China, Australia, Egypt, S.Korea, Brazil) How much total money each user spent at my store.

I want to graph out every possible combination of the above three categorical variables and see what was the average spending of each customer.

What is a good way to do this in R? (note: my R experience is average right now)

Thanks.

Micro
  • 10,303
  • 14
  • 82
  • 120

4 Answers4

2

You can do this with aggregate:

set.seed(144)
dat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                 browser=sample(c("mozilla", "ie"), 10000, replace=T),
                 country=sample(c("usa", "canada"), 10000, replace=T),
                 money=runif(10000))
aggregate(dat$money, by=list(browser=dat$browser, email=dat$email,
                             country=dat$country), mean)
#   browser email country         x
# 1      ie gmail  canada 0.4905588
# 2 mozilla gmail  canada 0.5064342
# 3      ie yahoo  canada 0.4894398
# 4 mozilla yahoo  canada 0.4959031
# 5      ie gmail     usa 0.5069363
# 6 mozilla gmail     usa 0.5088138
# 7      ie yahoo     usa 0.4957478
# 8 mozilla yahoo     usa 0.4993698

To get multiple columns like mean and count together, you can do:

res = aggregate(dat$money, by=list(browser=dat$browser, email=dat$email,
                                   country=dat$country),
                FUN=function(x) c(mean=mean(x), count=length(x)))
res
#   browser email country       x.mean      x.count
# 1      ie gmail  canada    0.4905588 1261.0000000
# 2 mozilla gmail  canada    0.5064342 1227.0000000
# 3      ie yahoo  canada    0.4894398 1267.0000000
# 4 mozilla yahoo  canada    0.4959031 1253.0000000
# 5      ie gmail     usa    0.5069363 1240.0000000
# 6 mozilla gmail     usa    0.5088138 1236.0000000
# 7      ie yahoo     usa    0.4957478 1213.0000000
# 8 mozilla yahoo     usa    0.4993698 1303.0000000
josliber
  • 43,891
  • 12
  • 98
  • 133
  • +1 I'm not sure if the OP cares...but can this deal with empty combos? – Frank Apr 11 '14 at 19:23
  • @Frank no, this will drop empty combos. – josliber Apr 11 '14 at 19:28
  • This works pretty well. How would I go about putting a column in that output list that shows how many instances there are of that particular combination? of browser+email+country? – Micro Apr 14 '14 at 16:21
  • 1
    @frank Yeah, empty combinations don't really matter since they don't exist in the first place. – Micro Apr 14 '14 at 16:25
  • @user244034 I updated it with an example of how to aggregate with two output variables (I showed mean and count). – josliber Apr 14 '14 at 16:36
  • @josilber Perhaps you are missing something from the equation? It is still only displaying the mean, and no count. – Micro Apr 14 '14 at 17:54
  • @josilber I saw elsewhere that your syntax is correct, but for some reason it isn't working for me. It only outputs the mean column. Maybe it is my R version or something (newest one). – Micro Apr 14 '14 at 19:01
  • @user244034 we probably can't help you unless you provide a reproducible example where this code doesn't work. As you can see by running the code segment I've posted, it works with the synthetic dataset I've constructed. – josliber Apr 14 '14 at 19:40
  • @josilber Thanks anyways. I think there is something else going on with my R software, since I copied you exactly. Thanks though, your response was the clearest for me to understand and you helped a great deal =) – Micro Apr 14 '14 at 20:58
  • @Josilber Ok. I figured out what was going on. I was trying to store the results of the aggregate function into a variable. When I did that, it made a data.frame that only stored the "x" or mean column, followed by the x.count output in the same column. So the question is, how can I save all the results of that aggregate statement into a data.frame neatly? – Micro Apr 15 '14 at 16:31
  • @user244034 I updated the example to show how to save in a data frame called `res`. I'm not sure I understand what the confusion is here -- you just create a new variable. – josliber Apr 15 '14 at 17:28
1

That there is:

expand.grid(gender = c("male", "female"), tShirtSize = c("xs", "s","m","l","xl"))

Will return all the combinations in a dataframe. For the summary stats, try aggregate, e.g:

country = sample(c("america", "canadian"), 30, replace = TRUE)
gender = sample(c("male", "female"), 30, replace = TRUE)
x = abs(rnorm(30) * 1000)
aggregate(data.frame(x), by = list(country, gender), FUN = mean)

I run into errors if there are columns with strings in the dataframe, so I'd subset out the columns with numeric values.

Sean Easter
  • 859
  • 8
  • 16
1

Here's a method that utilizes dplyr

require(magrittr)
require(dplyr)    

set.seed(123)
dat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                 browser=sample(c("mozilla", "ie"), 10000, replace=T),
                 country=sample(c("usa", "canada"), 10000, replace=T),
                 money=runif(10000))  
dat %>%
  group_by(email, browser, country) %>%
  summarize(mean = mean(money))
# email browser country      mean
# 1 gmail      ie  canada 0.5172424
# 2 gmail      ie     usa 0.4921908
# 3 gmail mozilla  canada 0.4934892
# 4 gmail mozilla     usa 0.4993923
# 5 yahoo      ie  canada 0.5013214
# 6 yahoo      ie     usa 0.5098280
# 7 yahoo mozilla  canada 0.4985357
# 8 yahoo mozilla     usa 0.4919743

EDIT: if you want to pass a list into group_by(), you'll need to use the not-non-standard evaluation counterpart, regroup(). For example,

mylist <- list("email", "browser", "country")
dat %>%
  regroup(mylist) %>%
  summarize(mean = mean(money))

also see dplyr: How to use group_by inside a function?

Community
  • 1
  • 1
kevinykuo
  • 4,600
  • 5
  • 23
  • 31
  • What does the %>% mean? – Micro Apr 14 '14 at 16:29
  • `%>%` is the "pipe" operator... basically, the output on one line gets passed as the input to the function in the next. see http://stackoverflow.com/questions/8896820/how-to-implement-fs-forward-pipe-operator-in-r – kevinykuo Apr 14 '14 at 16:37
  • Is there some way to pass in an array (or list) of column names into the `group_by()` function? I would like to make a function where the above is dynamic, so one could set the columns he/she wants in a list or array and then pass it into the `group_by()` function as a parameter. I tried various things but `group_by()` didn't want to take them. – Micro Apr 17 '14 at 21:02
  • see edited answer. i believe the package author might have something in the pipeline that gives an even more elegant solution. https://github.com/hadley/dplyr/issues/352 – kevinykuo Apr 17 '14 at 21:15
  • Using your edit I searched and found another useful link explaining the above problem that I added to the edit. – Micro Apr 18 '14 at 13:56
0

Create some dummy data:

dataset <- data.frame(
    spend=10*runif(100),
    email=sample(c("yahoo","gmail","hotmail","aol"),100,replace=TRUE),
    browser=sample(c("Mozilla","IE","Chrome","Opera"),100,replace=TRUE),
    country=sample(c("USA","Canada","China","Australia",
      "Egypt","S.Korea","Brazil"),100,replace=TRUE))

Average the spend per combination:

with(dataset,by(spend,list(email,browser,country),mean))

Note the NAs for combinations without entries.

Or turn this into a three-dimensional array:

as.table(with(dataset,by(spend,list(email,browser,country),mean)))
Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48