0

I am dealing with a dataframe in R that has only 2 columns but a very large number of rows. I want to divide my dataframe into subsets of say 'm' rows each and find the mean of the values in a column for each of these m rows for each division of the dataframe and then return these mean values for all the divisions containing 'm' rows each.

Say my data frame is y with columns 'a' and 'b' and i want 'm' as 1000 in this case.

I want to find mean(y[i:i+999,2])

I would want to take the value of i over all the rows and return the mean values for, in this case each block of 1000 values in column 'b'

i=1
add=function(i,999){i=i+999}
z=return(i)
p=mean(y[z,2])

I think I am doing it wrong. Any insights ?

Anurag Mishra
  • 1,007
  • 6
  • 16
  • 23

3 Answers3

2

The zoo package has rollapply which is really useful for applying a rolling function like this. You can use sapply to loop over the columns of a data.frame and apply the rollapply function (sapply loops over the elements of a list and a dataframe is actually a collection of lists).

Hopefully this example makes sense...

require(zoo)
## Sample data, two columns one million rows
df <- data.frame( A = runif(1e6) , B = runif(1e6) )

## Set desried 'chunk' size, i.e. the
## number of rows to find the mean of
## at once. Let's do 1e4, so we will 
## get 100 values back (1e6/1e4=1e2)
m = 1e4

## use sapply to loop across the columns, and
## apply rollapply to each column, which takes
## the mean of each set of 10,000 values
dfMean <- sapply( df , function(x) rollapply( x , width = m , by = m , align = "left" , FUN = mean ) )

nrow(dfMean)
#[1] 100

head(dfMean)
#            A         B
#[1,] 0.4966775 0.4992207
#[2,] 0.5013934 0.4986489
#[3,] 0.4994544 0.5009876
#[4,] 0.5020374 0.4979467
#[5,] 0.5049408 0.4999280
#[6,] 0.4969987 0.5018564
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
1

Using data.table will be your fastest option, and I think the by= syntax for "grouping by" is very intuitive.

library(data.table)

# Sample data:
dt<-data.table(A=runif(1e6L), B=runif(1e6L), key="B")

# Note that keying by column B will order the rows by B;
# You can leave out the key if you don't care about the order
# or have already set it

# Average every 1000 records in column B:
dt[,list(avg=mean(B)),by=rep(1L:nrow(dt),each=1000L,length.out=nrow(dt))]

This took about a tenth of a second compared to over 48 seconds for the rollaply solution.

dnlbrky
  • 9,396
  • 2
  • 51
  • 64
0

If I understand correctly what you're asking, you need a Moving Average on your column b

y<-data.frame(a=runif(2000),b=runif(2000))

m=1000
means=NULL;p=NULL
for(i in 1:(nrow(y)-m)){
  p=c(p,mean(y[i:(i+999),2]))
}

plot(p)

Moving Average

Romain
  • 6,322
  • 3
  • 35
  • 40
  • 1
    If they want the moving average, they should use `filter`. If you want to use a loop for this, at least pre-allocate `p`. – Roland May 11 '13 at 10:28
  • `p<-NULL` works fine and I'm not sure whether OP wants a MA or just a mean on 1k-size blocks. Moreover `filter()` is in the `stats` package so loading a package just for that... – Romain May 11 '13 at 11:55
  • 1
    You're in the [Second Circle of Hell](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf). Package stats is loaded by default. – Roland May 11 '13 at 12:52
  • for(i in 1:(nrow(y)-m)){p=c(p,mean(y[i:(i+999),2]))} @Romain This code gives error : In mean.default(y[i:(i + 999), 2]) : argument is not numeric or logical: returning NA – Anurag Mishra May 11 '13 at 13:51
  • sorry, i got it. there was some other mistake in my code :) thanks a lot – Anurag Mishra May 11 '13 at 14:03
  • @AnuragMishra if this solved your problem, then ideally you would accept it as an answer and press the green check mark to mark the question as answered and help to keep the site clean. However, you are under **no** obligation to do so. Thanks! – Simon O'Hanlon May 11 '13 at 15:52