0

Im trying to use dput() to create a reproducible example with a large database. The database needs to be large as the reproducible example involves moving averages. The way I've found to do this involves the function reproduce, shared here How to make a great R reproducible example? by @Ricardo Saporta. reproduce is based on dput() (code here https://github.com/rsaporta/pubR/blob/gitbranch/reproduce.R).

library(data.table) 
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")

data <- read.table("http://pastebin.com/raw/xP1Zd0sC")
setDF(data)
reproduce(data, rows = c(1:100))

That code creates data dataframe, and then provides a dput() output for it. It uses the rows argument to output the full dataframe. Yet if I use such output to recreate the dataframe, it fails.

Trying to allocate the dput() output to a new dataframe results in incomplete code, requiring me to add three parentheses manually at the end. And after doing so, I get the following error message: "Error in View : arguments imply differing number of rows: 100, 61".

Please not that the dput() output from reproduce without the rows = c(1:100) argument works fine. It just does not output the full dataframe, but rather a sample of it.

#This works fine
reproduce(data)

Please also note that I used the pastebin method to create this reproducible example. That method does not replace the dput() method for my purposes because it fails whenever trying to import data where some columns have spaces between the words (e.g. dataframes with datetime stamps).

EDIT: after some further troubleshooting discovered that reproduce fails as described above when the rows argument is used together with a dataframe containing 4 or more columns. Will have to find an alternative.

If anyone is interested in testing this, run the code above with the following links, all containing different number of columns:

1) 100x5: http://pastebin.com/raw/xP1Zd0sC

2) 100x4: http://pastebin.com/raw/YZtetfne

3) 100x4: http://pastebin.com/raw/63Ap2bh5

4) 100x3: http://pastebin.com/raw/1vMMcMtx

5) 100x3: http://pastebin.com/raw/ziM1bYQt

6) 100x1: http://pastebin.com/raw/qxtQs5u4

Community
  • 1
  • 1
Krug
  • 1,003
  • 13
  • 33
  • Trying to produce a reproducible example for an SO question. – Krug May 01 '16 at 18:42
  • Same problem using `rows=100`. – Krug May 01 '16 at 18:44
  • Thanks Richard. `dput()` doesn't allow to specify the number of rows. At least no mention of it on its documentation. Its output is a few rows at the beginning and a few at the end. I need to output the whole database, rather than a sample. – Krug May 01 '16 at 19:21
  • Awesome! Want to put that as answer? I would have saved a lot of time by directly asking how to get `dput()` to output a full database. – Krug May 01 '16 at 19:24

1 Answers1

2

If you are just trying to dput() the first 100 rows of a data set, then you can simply subset the data just prior to running dput(). There doesn't seem to be a need to use the linked function.

dput(droplevels(head(data, 100)))  ## or dput(droplevels(data[1:100,]))

should do it.

It is, however, peculiar that your try on reproduce() did not work. I would file an issue on the github page for that. You will likely get a more constructive answer there.

Thanks to @David Arenburg for reminding me about droplevels(). It is useful on this operation if we have factor columns. "Leftover" levels will be dropped.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • I now realize a benefit of `replicate` vs `dput()` is that the first outputs everything on one row, while `dput()` requires lots of editing before sharing on SO. Its a pitty `replicate` has that little bug. – Krug May 01 '16 at 19:48