I'm trying to develop a program to allow visualization of big data in graphs. Basically, the idea is that I can input a huge dataset and output a line graph in which I can actually see the trends.
Here is my idea (please let me know if there are already algorithms like this built into R or in a package, as I realize this is a very basic or 'primitive' way of aggregating data. I also don't want to use sample()
because I am specifically looking for trends in data. I realize that there is always going to be a trade-off between accuracy of data and ease of data representation in this case.):
Let's say I have a standard csv dataset of 10,000 numeric rows (columns representing variables).I want to create a resultant dataset that takes this huge dataset and separates it into 20-30 bins, each bin representing a datapoint that is the average of a certain number of data points in the big dataset. For example, if I had 10 bins, each bin would be the average of 1,000 datapoints.
Here is my code:
average <- function(dataf)
{
numericdata <- dataf[,sapply(dataf,is.numeric)]
***mean(numericData, trim = 0, na.rm = TRUE)
}
x <- names(numericData)
real <- ddply(diamonds, .(x), average)
***I do not know what to do here. Here is the place where I want to separate the numbericdata into a certain number of bins, in which the data in each bin will be averaged out.
On another important note, most of my datasets that I input will have Time variables (this is why I mentioned a line graph). The mean()
function only works on numeric data, so how could I average out a time column? By averaging out, I mean that the time column was in YYYY-MM-DD format, I can aggregate the days and graph the data by month (YYYY-MM). If this is the case, then I would not even have to worry about averaging the other columns!
How can I do this?
Thanks for any input, and sorry for the long post, I felt like I needed to provide all the necessary information.