3

I'm revising my question in full. I realized it was long, and my point was getting lost.

Here's what I need to do:

Create automatic reports for schools that contain tables that compare their data to the district that school is in as well as the whole state. The State is the entire data set.

Here's what I understand:

How to create the automated loop that transverses through the data and creates a unique PDF report for each school. This post was very helpful in setting up a framework to generate reports.

Here's what I need help with:

I need a table that contains the following columns: School, District, State I also need the first column of the table to contain a row for: Sample Size, Mean, Standard Deviation.

I'm trying to create this in the context of a for loop because I need a unique table in each unique pdf that is created. If there is a better approach I'd love to hear about it.

Anyways, here is a reproducible example that I tested. I haven't gotten far in creating the table.

Any help would be very appreciated.

driver.r:

# Create dataset
set.seed(500)
School <- rep(seq(1:20), 2)
District <- rep(c(rep("East", 10), rep("West", 10)), 2)
Score <- rnorm(40, 100, 15)
Student.ID <- sample(1:1000,8,replace=T)
school.data <- data.frame(School, District, Score, Student.ID)

#prepare for multicore processing 
require(parallel)
# generate the rmd files, one for each school in df
library(knitr)
mclapply(unique(school.data$School), function(x) 
  knit("F:/sample-auto/auto.Rmd", 
       output=paste('report_', x, '.Rmd', sep="")))

# generate PDFs from the rmd files, one for each school in df
mclapply(unique(school.data$School), function(x)
  rmarkdown::render(paste0("F:/sample-auto/", paste0('report_', x, '.Rmd'))))

auto.Rmd:

---
title: "Automated Report Generation for Data"
author: "ME"
date: "February 5, 2015"
output: 
  pdf_document:
  toc: true
  number_sections: true
---

```{r, echo=FALSE}
library(xtable)
library(plyr)
df <- data.frame(school.data)
subgroup <- df[school.data$School == x,]
```

# Start of attempt 

```{r results='asis', echo=FALSE}
 for(school in unique(subgroup$School))
{
subgroup2 <- subgroup[subgroup$School == school,]
savename <- paste(x, school)
df2<- mean(subgroup2$Score, na.rm=TRUE)
df2 <- data.frame(df2)
print(xtable(df2))
}
```

I also tried replacing the loop with:

```{r results='asis', echo=FALSE}
df2 <- ddply(school.data, .(School), summarise, n = length(School), mean =      
mean(Score), sd = sd(Score))
print(xtable(df2))
```

This gives me what I don't want in that all of the schools get the data for every school vs. just their school.

Community
  • 1
  • 1
bfoste01
  • 337
  • 1
  • 2
  • 14
  • 1
    the problem with attempt 1 is that you do not use the iteration variable `school` in the loop, so it will do the exact same thing for each iteration. maybe you wanted `mean(subgroup[subgroup$School == school, 'Score'])`? attempt 2 looks fine. – rawr Feb 06 '15 at 02:10
  • I totally rebooted the question because I don't think I was clear with what I needed. Thanks. – bfoste01 Feb 06 '15 at 17:56
  • 1
    each school has only one observation, is that intentional? so the sample size will be 1 (school), the mean will be whatever the value is, the SD will not be evaluable. If you are using plyr, you don't need to create separate tables for all schools in unique(school); `ddply(school.data, .(school), summarise, n = length(school), mean = mean(score), sd = sd(score), etc)` – rawr Feb 06 '15 at 18:22
  • @rawr: Thanks for catching that. I fixed it, and there should be 2 observations for each school.The data set I'm trying to get this to work with is over 100K students, so I hope this simplified version gets at what I'm dealing with. Does that impact your answer? I just tried playing with your code, but it's not yielding anything different in the pdfs that are generated for each school, unless it was how I was plugging it in. Does the code take the place of the loop entirely? Thanks! – bfoste01 Feb 06 '15 at 18:50
  • Also, each school is only to get their data, their district's data and the state's data back, so they shouldn't see anything else in the table, which is why I thought I needed separate tables. – bfoste01 Feb 06 '15 at 18:53

1 Answers1

2

If you are using a loop to subset the data before passing it to the .rmd file, you don't really need plyr or ddply to do the split/apply/combine for you. Since you have a lot of observations, it may be noticeable overhead.

Also, if you are creating the subgroups before running the .rmd, you don't need the loop inside the file, either. You just need to make a data frame with the stats you want and use xtable

---
title: "Automated Report Generation for Data"
author: "ME"
date: "February 5, 2015"
output: 
  pdf_document:
    toc: true
    number_sections: true
---

```{r, echo=FALSE}
library(xtable)
library(plyr)
# Create dataset
set.seed(500)
School <- rep(seq(1:20), 2)
District <- rep(c(rep("East", 10), rep("West", 10)), 2)
Score <- rnorm(40, 100, 15)
Student.ID <- sample(1:1000,8,replace=T)
school.data <- data.frame(School, District, Score, Student.ID)


x <- unique(school.data$School)[1]
subgroup <- school.data[school.data$School == x, ]
```

# Start of attempt 

```{r results='asis', echo=FALSE}
options(xtable.comment = FALSE)
## for one school, it is redundant to split based on school but works
## likewise, it is redundant to have a loop here to split based on school
## if you have already used a loop to create the subgroup data 
res <- ddply(subgroup, .(School), summarise,
             n = length(School),
             mean = mean(Score),
             SD = sd(Score),
             VAR = var(Score))
xtable(res)

## if you pass in the entire data frame you will get all schools
## then you can subset the one you want
res <- ddply(school.data, .(School), summarise,
             n = length(School),
             mean = mean(Score),
             SD = sd(Score),
             VAR = var(Score))

xtable(res[res$School %in% x, ])
```

enter image description here

rawr
  • 20,481
  • 4
  • 44
  • 78
  • Any advice on how to isolate the mean value for the entire district? I've used aggregate, with coupled with tapply, and ddply to no avail. All methods give me the results for both districts. When I try to select out with `xtable(test[test$School %in% x, ])` I get blanks. – bfoste01 Feb 10 '15 at 14:31
  • 1
    you'd have to regroup by district. using ddply, `ddply(school.data, .(District), summarise, mean = mean(score))` – rawr Feb 10 '15 at 15:13
  • Thanks, I had already tried that. The issue is that I want to loop through and pull out the mean for the specific district a specific school is in. I know how to use ddply to select a specific district with the aggregated data frame , but with 100s of schools that won't work for the automated reports. I tried `xtable(df[df$School %in% x, ])` with ddply to try and grab the district, much like you showed me previously, but it didn't work. That's what I'm trying to work out. Ultimately, each table has the statistics for the school the district it is in and the state. Thanks! – bfoste01 Feb 10 '15 at 15:53
  • 1
    what about if you add a column to `school.data` with the means per district, so that even after you subset, that column will still be there? like `school.data$dist_mean <- with(school.data, ave(Score, District, FUN = mean))` – rawr Feb 10 '15 at 20:30
  • I think that's going to do it! Thanks rawr!! – bfoste01 Feb 11 '15 at 00:49