Is there a quick way to get the mean of various "bins" of a column in a dataframe?

Question

I apologize if this has been answered - I just can't find it! To simplify, I have a dataframe of cars with 2 pertinent columns: mileage and price. I want to calculate the mean price and number of cars for 0-20,000 miles, 20,000-40,000, and so on (in 20,000 mile "bins"). I have been making subsets of data for the various mileage ranges and then looking at the mean and number or vehicles for that subset. I'm wondering if there is a more efficient way to do this, instead of making all of these subsets - I'm doing it many times over with various "bins" and data. I would love to learn a slicker way of doing this.

Thanks!!

Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. — Arun, Apr 16 '13 at 21:35

score 2 · Answer 1 · answered Apr 16 '13 at 21:45

2

You probably want smth along these lines:

library(data.table)
d = data.table(mileage = runif(1000, 0, 100000), price = runif(1000, 15000, 35000))

d[, list(price = mean(price), number = .N),
    by = cut(mileage, c(0, 20000, 25000, 30000, 100000))][order(cut)]
#                cut    price number
# 1:       (0,2e+04] 25252.70    215
# 2: (2e+04,2.5e+04] 25497.66     46
# 3: (2.5e+04,3e+04] 25349.79     45
# 4:   (3e+04,1e+05] 25037.93    694

answered Apr 16 '13 at 21:45

eddi

49,088
6
104
155

`cut` makes much more sense here! (+1). – Arun Apr 16 '13 at 21:53
Thank you! This is perfect! I love expanding my R knowledge instead of brute-forcing it out. I appreciate you taking the time. – user2012544 Apr 17 '13 at 19:41

IRTFM · Answer 2 · 2013-04-16T22:09:10.180

1

This shows how to use aggregate to return more than one statistic by category in a single run.

 # Using Quentin's data
 d[['mileage.cat']] <- cut(d$mileage, breaks=seq(0, 200000, by= 20000))
 aggregate(d$price, d['mileage.cat'] , 
                 FUN=function(price) c(counts=length(price),
                                         mean.price=mean(price) ) )

    mileage.cat x.counts x.mean.price
1     (0,2e+04]   212.00     24859.01
2 (2e+04,4e+04]   194.00     24343.16
3 (4e+04,6e+04]   196.00     24357.73
4 (6e+04,8e+04]   191.00     25006.71
5 (8e+04,1e+05]   207.00     25250.23

edited Apr 16 '13 at 22:09

answered Apr 16 '13 at 21:57

IRTFM

258,963
21
364
487

If you want both length and mean at the same time, this is an improvement compared to my answer :). – Quentin Geissmann Apr 16 '13 at 22:02
If you wanted to use `aggregate.formula` then your function could still be `function(x) c(mean=mean(x), length=length(x) )` – IRTFM Apr 16 '13 at 22:13

score 0 · Answer 3 · answered Apr 16 '13 at 21:53

To make the bins, "cut". Example:

x=1:10
bkpt=c(0,2.5,7.5,10)
x.cut=cut(x,breaks=bkpt)

which finishes up like this. y is some data for later:

y=21:30
data.frame(x,x.cut,y)

To calculate something for each group, use tapply. Following my example:

tapply(y,x.cut,length)
tapply(y,x.cut,mean)

which calculates (a) the number of y's and (b) the mean of y's in each of the groups defined by x.cut.

Quentin Geissmann · Answer 4 · 2013-04-16T22:03:54.517

0

Another approach using aggregate:

df <- data.frame(mil = sample(1e5,20),price = sample(1000,20) )
#mil2 is our "mile bin" ( 0 -> [0:20000[; 1 -> [20000:40000[ ...)
df$mil2 = trunc(df$mil /20000)

# then to get the mean by "mile bin":
aggregate(price ~ mil2,df,mean)
# or the number:
aggregate(price ~ mil2,df,length)
# or simply:
table(df$mil2)

edited Apr 16 '13 at 22:03

answered Apr 16 '13 at 21:54

Quentin Geissmann

2,240
1
21
36

Is there a quick way to get the mean of various "bins" of a column in a dataframe?

4 Answers4