Separate huge dataset into bins and average them out in R

Question

I'm trying to develop a program to allow visualization of big data in graphs. Basically, the idea is that I can input a huge dataset and output a line graph in which I can actually see the trends.

Here is my idea (please let me know if there are already algorithms like this built into R or in a package, as I realize this is a very basic or 'primitive' way of aggregating data. I also don't want to use sample() because I am specifically looking for trends in data. I realize that there is always going to be a trade-off between accuracy of data and ease of data representation in this case.):

Let's say I have a standard csv dataset of 10,000 numeric rows (columns representing variables).I want to create a resultant dataset that takes this huge dataset and separates it into 20-30 bins, each bin representing a datapoint that is the average of a certain number of data points in the big dataset. For example, if I had 10 bins, each bin would be the average of 1,000 datapoints.

Here is my code:

average <- function(dataf)
{
  numericdata <- dataf[,sapply(dataf,is.numeric)]
  ***mean(numericData, trim = 0, na.rm = TRUE)
}
x <- names(numericData)
real <- ddply(diamonds, .(x), average)

***I do not know what to do here. Here is the place where I want to separate the numbericdata into a certain number of bins, in which the data in each bin will be averaged out.

On another important note, most of my datasets that I input will have Time variables (this is why I mentioned a line graph). The mean() function only works on numeric data, so how could I average out a time column? By averaging out, I mean that the time column was in YYYY-MM-DD format, I can aggregate the days and graph the data by month (YYYY-MM). If this is the case, then I would not even have to worry about averaging the other columns!

How can I do this?

Thanks for any input, and sorry for the long post, I felt like I needed to provide all the necessary information.

check out the [bigvis](https://github.com/hadley/bigvis) package. — haki, Aug 11 '13 at 09:54
I think converting your 'Time variable' to `Date` using `as.Date` may be a good start. Then have a look at `rollmean` in `zoo` package. Most importantly though: please provide a small, reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Henrik, Aug 11 '13 at 11:01
In my program, the user can choose which columns to choose for his axes. He can either choose a Date column or a numeric column. Is there a way for me to check whether he chooses a Date column? (then I can apply the as.Date() function). — jeffrey, Aug 11 '13 at 17:46
I suggest you convert everything that is a date to `Date` already in your original data set. `Date`s are very convenient to work with and there are loads of functions (plotting, aggregations, arithmetics, etc - supposedly relevant for you), in base and other packages, that takes `Date`s as input. You write that you felt felt like you needed to provide all the necessary information. However, dummy data, script of what you have tried on the dummy data, and the expected results, are still lacking. You will receive much more help with a minimal, reproducible example. — Henrik, Aug 11 '13 at 20:05

SlowLearner · Accepted Answer · 2013-08-12T08:37:28.727

0

Sounds like a simple enough job for ddply, which you already reference in your question?

require(lubridate)
require(plyr)

mylen <- 3000
mydf <- data.frame(mydate = seq(as.Date('2000-01-01'), length.out = mylen, by = 'day'),
                   value = runif(mylen, 10, 10000))

mydf$month <- month(mydf$mydate)
mydf$year <- year(mydf$mydate)

newdf <- ddply(mydf, .(year, month), summarise, my.mean = mean(value))

Output looks like this:

> tail(newdf)
   year month  my.mean
94 2007    10 5103.671
95 2007    11 5034.605
96 2007    12 5534.769
97 2008     1 4437.816
98 2008     2 4717.377
99 2008     3 5862.858
>

edited Aug 12 '13 at 08:37

answered Aug 11 '13 at 10:41

SlowLearner

7,907
11
49
80

Why do you have `require (ddply)`?Isn't `ddply` included in `plyr`? – Metrics Aug 11 '13 at 11:06
1

Apologies, you are quite right that was a typo on my part, now corrected. For some reason had got it into my head as I was writing that answer that `ddply` was a separate package. Duh! – SlowLearner Aug 12 '13 at 09:43

Separate huge dataset into bins and average them out in R

1 Answers1