Need help getting summary statistics for R data frame

Question

This is my data (imagine I have 1050 rows of data shown below)

ID_one  ID_two parameterX
111      aaa     23
222      bbb     54
444      ccc     39

My code then will divide the rows into groups of 100 (there will be 10 groups of 100 rows).

I then want to get the summary statistics per group. (not working) After that I want to place the summary statistics in a data frame to plot them.

For example, put all 10 means for parameterX in a dataframe together, put all 10 std dev for parameterX in the same a data frame together etc The following code is not working:

#assume data is available
dataframe_size <- nrow(thedata)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)

#splitdata into groups of 100
split_dataframe_into_groups <- function(x,y)
    0:(x-1) %% y 
list1 <- split(thedata, split_dataframe_into_groups(nrow(thedata), group_size))

 #print data in the first group
 list1[[1]]$parameterX

 #NOT WORKING!!!  #get summary stat for all 10 groups
 # how to loop through all 10 groups?
 list1_stat <- do.call(data.frame, list(mean = apply(list1[[1]]$parameterX, 2, mean),
     sd = apply(list1[[1]]$parameterX, 2, sd). . .))

the error message is always:

Error in apply(...) dim(x) must have a positive length That makes NO sense because when I run this code, There is clearly a positive length (data exists)

 #print data in the first group
 list1[[1]]$parameterX

  #how to put all means in a dataframe?
  # how to put all standard deviations in the same dataframe
  ex  df1 <- mean(2,2,3,4,7,2,4,,9,8,9),
             sd (0.1, 3 , 0.5, . . .)

Does this work for your code: `t(sapply(list1, function(x) c(mean = mean(x$parameterX), sd = sd(x$parameterX))))`? — Raad, May 06 '16 at 13:07
Are you creating the groups based on row number or based on ID1 or ID2? — Raphael K, May 06 '16 at 13:21
it is based on row number. Ex rows 1-100 will be in group 1, rows 101-201 will be in group 2 etc — James Rodriguez, May 06 '16 at 13:22
Hi NBAtrends, I tried your code. The means printed as NA but I see standard deviation. Why would mean be NA while sd is a valid number? Something does not seem right — James Rodriguez, May 06 '16 at 14:08

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

I think this might be a good place to use tapply. there is an excellent summary here! One path forward might be an extension of the below:

df <- data.frame(id= c(rep("AA",10),rep("BB",10)),  x=runif(20))
do.call("rbind", tapply(df$x, df$id, summary))

edited May 23 '17 at 12:31

Community

1
1

answered May 06 '16 at 13:27

greengrass62

968
7
19

Gaurav Taneja · Answer 2 · 2016-05-06T13:47:40.523

0

I think this is what you want :

require(dplyr)
dt<-rbind(iris,iris,iris)
dataframe_size <- nrow(dt)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
df<-dt %>% 
# Creating the "bins" column using mutate
mutate(bins=cut(seq(1:dataframe_size),breaks=number_ofgroups)) %>%
# Aggregating the summary statistics by the bins variable
group_by(bins) %>% 
# Calculating the mean
summarise(mean.Sepal.Length = mean( Sepal.Length))


head(dt)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

df

     bins mean.Sepal.Length
   (fctr)             (dbl)
1 (0.551,113]          5.597345
2   (113,226]          5.755357
3   (226,338]          5.919643
4   (338,450]          6.100885

edited May 06 '16 at 13:47

answered May 06 '16 at 13:38

Gaurav Taneja

1,084
1
8
19

Would you be able to clarify the answer? Thanks What is this: group_by(bins) %>% summarise(mean.Sepal.Length = mean( Sepal.Length)) ? and dt<-rbind(iris,iris,iris) and what is this: df<-dt %>% mutate(bins=cut(seq(1:dataframe_size),breaks=number_ofgroups)) %>% – James Rodriguez May 06 '16 at 13:44
`rbind(iris,iris,iris)` is just to create a dataset with enough rows so that binning 100 rows makes sense. The approach uses dplyr which is very simmillar to SQL in readibility. You can know more at : https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html – Gaurav Taneja May 06 '16 at 13:49
it ran but how can i use this code for my purposes? I need mean, standard deviation etc. I should use the mutate function? Thanks – James Rodriguez May 06 '16 at 14:06
you can use the `mean()` , `std()` etc.. functions within `summarize()` take a look at this link : http://www.r-bloggers.com/using-r-quickly-calculating-summary-statistics-with-dplyr/ – Gaurav Taneja May 06 '16 at 14:23

Raphael K · Accepted Answer · 2016-05-06T15:01:57.203

0

dplyr is so good for this kind of thing. If you create a new column that assigns a 'group' ID based on row location, then you can summarize each group very easily. I use an index to assist in assigning group IDs.

install.packages('dplyr')
library(dplyr)

## Create index
df$index <- 1:nrow(df)  

## Assign group labels
df$group <- paste("Group", substr(df$index, 1, 1), sep = " ")  
df[df$index <= 100, 'group'] <- "Group 0"
df[df$index > 1000, 'group'] <- paste("Group", substr(df$index, 1, 2), sep = " ")
df[df$index > 10000, 'group'] <- paste("Group", substr(df$index, 1, 3), sep = " ")

## Get summaries    
df <- group_by(df, group)
summaries <- summarise(df, avg = mean(parameterX),
minimum = min(parameterX), 
maximum = max(parameterX),
med = median(parameterX),
Mode = mode(parameterX))

... and so on.

Hope this helps.

edited May 06 '16 at 15:01

answered May 06 '16 at 13:56

Raphael K

2,265
1
16
23

Sorry, I figured it out before you typed that, silly me!! it seems to be working!, But what are the 100 and 1000? How can I use parameters instead of hardcoding? Thx – James Rodriguez May 06 '16 at 14:31
I tried to use parameters and it gave me an error df[df$index <= group_size, 'group'] <- "Group 0" df[df$index > number_of_groups, 'group'] <- "Group 10" – James Rodriguez May 06 '16 at 14:33
The way I used `substr()` makes it so you have to hardcode anything less than 100 or more than 1000. Otherwise your groupings will be off. Group 1 will have anything from 0-200 and over 1000. If you want to parameterize it more I'm sure you can design a for loop using substr() that does it pretty easily. What was the error you got? Also, if this answer works for you, hit that upvote button, and if it *really* works for you, hit that checkmark. :) – Raphael K May 06 '16 at 14:43
actually the number of rows dataframe_size can vary depending of the amount of data in the file so it would be nice to have that value in particular be a parameter not hardcoded. i dont know how many rows will be in the data maybe 1000 maybe 10000 maybe 5000 et – James Rodriguez May 06 '16 at 14:49
That'll take care of you up to 100,000 rows. There's probably a better, more complete way. I will think about it. – Raphael K May 06 '16 at 15:03
I tried using 1000 instead of size 100 and 50000 instead of size 1000, now the group sizes are 1000 except group 2 has 11000, group 10 has 84000 group 4,5,6 have 11000 instead of consistent sizes – James Rodriguez May 06 '16 at 15:04
I understood your problem to be you want groups of 100. The code I wrote is based on that. – Raphael K May 06 '16 at 15:07
my mistake, i should have clarified that in my initial message. – James Rodriguez May 06 '16 at 15:31

Need help getting summary statistics for R data frame

3 Answers3