0

I apologize if this has been answered - I just can't find it! To simplify, I have a dataframe of cars with 2 pertinent columns: mileage and price. I want to calculate the mean price and number of cars for 0-20,000 miles, 20,000-40,000, and so on (in 20,000 mile "bins"). I have been making subsets of data for the various mileage ranges and then looking at the mean and number or vehicles for that subset. I'm wondering if there is a more efficient way to do this, instead of making all of these subsets - I'm doing it many times over with various "bins" and data. I would love to learn a slicker way of doing this.

Thanks!!

  • Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. – Arun Apr 16 '13 at 21:35

4 Answers4

2

You probably want smth along these lines:

library(data.table)
d = data.table(mileage = runif(1000, 0, 100000), price = runif(1000, 15000, 35000))

d[, list(price = mean(price), number = .N),
    by = cut(mileage, c(0, 20000, 25000, 30000, 100000))][order(cut)]
#                cut    price number
# 1:       (0,2e+04] 25252.70    215
# 2: (2e+04,2.5e+04] 25497.66     46
# 3: (2.5e+04,3e+04] 25349.79     45
# 4:   (3e+04,1e+05] 25037.93    694
eddi
  • 49,088
  • 6
  • 104
  • 155
1

This shows how to use aggregate to return more than one statistic by category in a single run.

 # Using Quentin's data
 d[['mileage.cat']] <- cut(d$mileage, breaks=seq(0, 200000, by= 20000))
 aggregate(d$price, d['mileage.cat'] , 
                 FUN=function(price) c(counts=length(price),
                                         mean.price=mean(price) ) )

    mileage.cat x.counts x.mean.price
1     (0,2e+04]   212.00     24859.01
2 (2e+04,4e+04]   194.00     24343.16
3 (4e+04,6e+04]   196.00     24357.73
4 (6e+04,8e+04]   191.00     25006.71
5 (8e+04,1e+05]   207.00     25250.23
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

To make the bins, "cut". Example:

x=1:10
bkpt=c(0,2.5,7.5,10)
x.cut=cut(x,breaks=bkpt)

which finishes up like this. y is some data for later:

y=21:30
data.frame(x,x.cut,y)

To calculate something for each group, use tapply. Following my example:

tapply(y,x.cut,length)
tapply(y,x.cut,mean)

which calculates (a) the number of y's and (b) the mean of y's in each of the groups defined by x.cut.

0

Another approach using aggregate:

df <- data.frame(mil = sample(1e5,20),price = sample(1000,20) )
#mil2 is our "mile bin" ( 0 -> [0:20000[; 1 -> [20000:40000[ ...)
df$mil2 = trunc(df$mil /20000)

# then to get the mean by "mile bin":
aggregate(price ~ mil2,df,mean)
# or the number:
aggregate(price ~ mil2,df,length)
# or simply:
table(df$mil2)
Quentin Geissmann
  • 2,240
  • 1
  • 21
  • 36