Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport

Question

Say I have a tall dataframe with many rows per group, like so:

df <- data.frame(group = factor(rep(c("a","b","c"), each = 5)),
                 v1    = sample(1:100, 15, replace = TRUE),
                 v2    = sample(1:100, 15, replace = TRUE),
                 v3    = sample(1:100, 15, replace = TRUE))

What I want to do is split df into length(levels(df$group)) separate dataframes, e.g.,

df_a <- df[df$group=="a",]; df_b <- df[df$group == "b",] ; ...

And then print each dataframe in a separate HTML/PDF/DOCX file (probably using Rmarkdown and knitr).

I want to do this because I have a large dataframe and want to create a personalized report for each group a, b, c, etc. Thanks.

Update (11/18/14)

Following @daroczig 's advice in this thread and another thread, I attempted to make my own template that would simply print a nicely formatted table of all columns and rows per group to substitute into the "correlations" template call in the original sapply() function. I want to make my own template rather than just printing the nice table (e.g., the answer @Thomas graciously provided) because I'd like to build additional customization into the template once the simple printing works. Anyway, I've certainly butchered it:

<!--head
meta:
  title: Sample Report
  author: Nicapyke
  description: This is a demo
  packages: ~
inputs:
- name: eachgroup
  class: character
  standalone: TRUE
  required: TRUE
head-->

### Records received up to present for Group <%= eachgroup %>

<%=
pandoc.table(df[df$group == eachgroup, ])
%>

Then, after saving that as groupreport.rapport in my working directory, I wrote the following R code, modeled after @daroczig's response:

allgroups <- unique(df$group)

library(rapport)


for (eachstate in allstates) {
  rapport.docx("FILEPATHHERE", eachgroup = eachgroup)
}

I received the error:

Error in openFileInOS(f.out) : File not found!

I'm not sure what happened. I see from the pander documentation that this means it's looking for a system file, but that doesn't mean much to me. Anyway, this error doesn't get at the root of the problem, which is 1) what should go in the input section of the custom template YAML header, and 2) which R code should go in the rapport template vs. in the R script.

I realize I may be making a number of errors that reveal my lack of experience with rapport and pander. Thanks for your patience!

N.B.:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.8       dplyr_0.3.0.2   rapport_0.51    yaml_2.1.13     pander_0.5.1
plyr_1.8.1          lattice_0.20-29

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      digest_0.6.4   evaluate_0.5.5 formatR_1.0    grid_3.1.2    
 [7] lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3    reshape_0.8.5  stringr_0.6.2 
[13] tools_3.1.2

score 2 · Accepted Answer · answered Nov 13 '14 at 22:21

2

A slightly off-topic, but still R/markdown one-liner for separate reports with report templates:

> library(rapport)
> sapply(levels(df$group), function(g) rapport.html('correlations', data = df[df$group == g, ], vars = c('v1', 'v2', 'v3')))
Exported to */tmp/RtmpYyRLjf/rapport-correlations-1-0.[md|html]* under 0.683 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-2-0.[md|html]* under 0.888 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-3-0.[md|html]* under 1.063 seconds.

The rapport package can run (predefined or custom) report templates on any (sub)dataset in markdown, then export it to HTML/docx/PDF/other formats. For a quick demo, I've uploaded the resulting documents:

answered Nov 13 '14 at 22:21

daroczig

28,004
7
90
124

Thanks, @daroczig. This might be more of what I'm looking for. About how much time would you say you invested in learning `markdown` syntax and `rapport` syntax before you could create a simple, custom template? – mcjudd Nov 14 '14 at 17:39
1

@Nicapyke markdown is pretty easy, the most important things can be learnt for life in 5 minutes. For more complex stuff, you definitely should read [Pandoc's markdown manual](http://johnmacfarlane.net/pandoc/README.html#pandocs-markdown), but there's no need to create those manually, as the [pander package](http://rapporter.github.io/pander/) can do that for you automatically from raw R objects. About `rapport`: I am one of the authors, so my opinion is rather one-sided :) But it should not take a lot more than learning `markdown`. Feel free to ping me if you'd get stuck. – daroczig Nov 14 '14 at 19:45
Thanks for the reply, @daroczig. I did a bit more surfing on SO and found a reply you made here: http://stackoverflow.com/questions/25407307/how-to-produce-markdown-document-for-each-row-of-dataframe-in-r?lq=1 I think that's more along the line of what I'm trying to do (except several rows per dataframe, of course). I'll implement that and let you know if I have any issues. Thanks again. – mcjudd Nov 14 '14 at 21:41
1

@Nicapyke well, developing a `rapport` template can be tricky for the first times :( I've created a working example based on your version here: https://gist.github.com/daroczig/8756c059235ed97247d3 Some things to note there: (1) no need to call `pander` or `pandoc.table` in the R chunks, as every R object is automatically transformed to markdown, (2) you have to define label/description for the inputs, (3) use `rapport.data` to pass a `data.frame` to the report template, (4) I've added a plot to the template so that file-name counter could work, (5) IMHO you have a path issue, use `getwd()` – daroczig Nov 19 '14 at 01:35

score 1 · Answer 2 · answered Nov 13 '14 at 21:59

1

You can do this with by (or split) and xtable (from the xtable package). Here I create xtable objects of each subset, and then loop over them to print them to file:

library('xtable')
s <- by(df, df$group, xtable)
for(i in seq_along(s)) print(s[[i]], file = paste0('df',names(s)[i],'.tex'))

If you use the stargazer package, you can get a nice summary of the dataframe instead of the dataframe itself in just one line:

library('stargazer')
by(df, df$group, stargazer, out = paste0('df',unique(df$group),'.tex'))

You should be able to easily include each of these files in, e.g., a PDF report. You could also use HTML markup using either xtable or stargazer.

answered Nov 13 '14 at 21:59

Thomas

43,637
12
109
140

Thanks for this answer, @Thomas. I'm not familiar with LaTeX. I do have MiKTeX 2.9 installed on my work computer though. Is creating the output from the .tex code as simple as loading the files into MiKTeX and having the interpreter evaluate and print the properly formatted tables? Thanks. – mcjudd Nov 14 '14 at 17:34
1

LaTeX is what Rstudio uses to create a pdf from an Rmd file. MikTeX is basically an implementation of LaTeX. The above examples will nearly work. A basic LaTeX tutorial would show you how to embed those tables in a simple document. – Thomas Nov 14 '14 at 20:31
Thanks, I'll follow up on that. Also, it's funny how you get focused on a certain aspect of programming and then never become aware of certain very basic functions that you could've used from the start. `split`, here, is the perfect example. `plyr` and `dplyr` have shielded me from base R, for better or for worse. :) – mcjudd Nov 14 '14 at 21:47
Yup, always good to play with the builtin tools because sometimes they're exactly what you need. A lot of R packages are just sugar. – Thomas Nov 14 '14 at 22:27

Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport

2 Answers2