1

I am wanting to do some statistics using R on a data set that I have. The data is in a list and is grouped using an identifying code, given here in the cat column

cat         AS_Year AS_Day  As_Month    EVENT_TYPE  RESULT_TYPE REASON_TYPE OPERATOR_TYPE   DATE_EVENT  Day_Total
9002F100AS2 2009    14       2          9002        F           100         AS2             14-Feb-09   2
9002F123AS2 2009    14       2          9002        F           123         AS2             14-Feb-09   1
9008F0AS2   2009    14       2          9008        F           0           AS2             14-Feb-09   1

There are thousands of these codes on each day and I would like to do some statistics on the volumes for each.

I have looked into things and have tried playing around with

ddply(dtest,~group,summarise,mean=mean(Day_Total),sd=sd(Day_Total))

This gives me NA for the mean and a s.d. that doesn't correlate with that which I get from using excel. I have also tested this on a simpler, smaller test data set and the means and s.d. don't seem to be correct. Does anyone have any advice on how to use this or am I missing something somewhere

Taylrl
  • 3,601
  • 6
  • 33
  • 44
  • 1
    Have you looked at [this SO thread](http://stackoverflow.com/a/25198553/1305688)? – Eric Fail Aug 11 '14 at 12:58
  • 3
    What if you replace `mean(Day_Total)` by `mean(Day_Total, na.rm=TRUE)`? – lukeA Aug 11 '14 at 12:59
  • 2
    Try to avoid `plyr` as it is slow. Try for example, `library(data.table); setDT(dtest)[, list(mean = mean(Day_Total, na.rm = T), sd=sd(Day_Total, na.rm = T)), by = cat]` – David Arenburg Aug 11 '14 at 12:59
  • @David That did the trick! Do you have any further explanation of this and how I can use it further. For example I would then like to look at one 'cat' code only and start to analyse that one only further. Can I pull one 'cat' code out into its own data.table or multiple 'cat' codes? – Taylrl Aug 13 '14 at 10:21
  • You can do anything you want. Just clarify your question and explain what are you trying to do exactly and what is your desired output – David Arenburg Aug 13 '14 at 10:32
  • Ok, I am actually working with a different data set to this now so it might be worthwhile starting a new thread where I explain what I need to do there – Taylrl Aug 13 '14 at 10:55
  • Ok, post me a link here so I will be able to check it out. In a meanwhile I'll post my comment as an answer in order to close this – David Arenburg Aug 13 '14 at 11:45
  • Thanks for your help, you have been brilliant. Here is the link to the new thread http://stackoverflow.com/questions/25285788/grouping-data-by-time-stamp-and-then-activity-type-using-r – Taylrl Aug 13 '14 at 13:25
  • Try providing a reproducible example and the desired output (in your new question). As it stands now it is very unclear – David Arenburg Aug 13 '14 at 13:54

2 Answers2

2

Try the very efficient data.table package

library(data.table) 
setDT(dtest)[, list(mean = mean(Day_Total, na.rm = T), 
                    sd=sd(Day_Total, na.rm = T)), by = cat]

Or if you prefer to stick with the plyr series, try the newer and much more efficient dplyr package

Note: Detach plyr first by doing detach("package:plyr", unload = TRUE)

library(dplyr)
dtest %>% 
  group_by(cat) %>%
  summarise(mean = mean(Day_Total, na.rm = T), sd=sd(Day_Total, na.rm = T))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
0

I assume by group you meant cat in your oneliner. Can it be that your Day_Total or cat is not the right type ? Can it be that there are some Non applicable value in the Day_Total column ?

What gives?

ddply(dtest,.(as.factor(cat)), summarise, mean=mean(Day_Total,na.rm=true),sd=sd(Day_Total,na.rm=true))