1

The dataset consists of a company identifier (a string), three columns of data values and a date (yyyymmdd). There are around 25,500 unique company identifiers and each has daily values from Jan. 1 1973 to the present for a total of around 700,000 rows. What I would like to do is calculate some statistics (i.e. range, mean, median, SD, etc) for each date in the dataset. The data was originally a csv and was imported into R as a dataframe. My first attempt is below but I was wondering if there is a more efficient method than looping through 25,000 rows

stat <- data.frame(Date=as.Date(character()), Mean=numeric(), SD=numeric(), Quant_75=numeric(), Quant_25=numeric(), Range=numeric(), stringsAsFactors=FALSE) 
uniq <- unique(unlist(data$Date))
for (i in 1:length(uniq)){
data_sub <- subset(data, date == uniq[i])
stat[i,] = rbind(date, mean(data_sub), sd(data_sub), quantile(data_sub, 0.75), quantile(data_sub,0.25), range(data_sub) )
}
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
user6893
  • 143
  • 1
  • 2
  • 9

1 Answers1

0

There is a plethora of options for doing this in R: a couple of them native to R, others like plyr and sqldf have their own package. I would recommed the data.table package because of its speed on large data sets. On your data something like:

require(data.table)
data = data.table(data)
data[,list(AVG_BY_DAY:=mean(COLNAME1,na.rm=T),SD_BY_DAY:=sd(COLNAME2,na.rm=T,Q_75_BY_DAY=quantile(COLNAME3,0.75)) ),by="Date"]
data[,list(AVG_BY_COMPANY:=mean(COLNAME1,na.rm=T),SD_BY_COMPANY:=sd(COLNAME2,na.rm=T,Q_75_BY_COMPANY=quantile(COLNAME3,0.75)) ),by="Company"]

Check out this link, it could be useful to anybody handling data in R.

Community
  • 1
  • 1
crogg01
  • 2,446
  • 15
  • 35
  • Actually I am getting the following error. Error in `:=`(AVG_BY_DAY, mean(technical$PRC, na.rm = T)) : Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":="). – user6893 Apr 24 '14 at 21:48
  • 1
    you don't need to specify the data.table name: so "`:=`(AVG,mean(PRC,na.rm=T))" should suffice. Also make sure PRC is numeric and not a factor + don't forget the backticks around `:=` (invisible in the comments because it is used for quoting code on stackoverflow) if you use the notation `:=`(AVG=mean(PRC,na.rm=T),SD=sd(PRC,na.rm=T)) instead of `list(AVG:=mean(...),SD:=sd(...)` – crogg01 Apr 24 '14 at 22:02