48

I miss a way to add data to an SO answer in a transparent manner. My experience is that the structure object from dput() at times confuses inexperienced users unnecessary. I do however not have the patience to copy/paste it into a simple data frame each time and would like to automate it. Something similar to dput(), but in a simplified version.

Say I by copy/pasting and some other hos have data like this,

Df <- data.frame(A = c(2, 2, 2, 6, 7, 8),
                 B = c("A", "G", "N", NA, "L", "L"),
                 C = c(1L, 3L, 5L, NA, NA, NA))

looks like this,

Df
#>   A    B  C
#> 1 2    A  1
#> 2 2    G  3
#> 3 2    N  5
#> 4 6 <NA> NA
#> 5 7    L NA
#> 6 8    L NA

Within one integer, one factor and one numeric vector,

str(Df)
#> 'data.frame':    6 obs. of  3 variables:
#>  $ A: num  2 2 2 6 7 8
#>  $ B: Factor w/ 4 levels "A","G","L","N": 1 2 4 NA 3 3
#>  $ C: int  1 3 5 NA NA NA

Now, I would like to share this on SO, but I do not always have the orginal data frame it came from. More often than not I pipe() it in form SO and the only way I know to get it out is dput(). Like,

dput(Df)
#> structure(list(A = c(2, 2, 2, 6, 7, 8), B = structure(c(1L, 2L, 
#> 4L, NA, 3L, 3L), .Label = c("A", "G", "L", "N"), class = "factor"), 
#> C = c(1L, 3L, 5L, NA, NA, NA)), .Names = c("A", "B", "C"), row.names = c(NA, 
#> -6L), class = "data.frame")

but, as I said at the top, these structures can look quite confusing. For that reason I am looking for a way to compress dput()'s output in some way. I imagine an output that looks something like this,

dput_small(Df)
#> data.frame(A = c(2, 2, 2, 6, 7, 8), B = c("A", "G", "N", NA, "L", "L"),
#> C = c(1L, 3L, 5L, NA, NA, NA))

Is that possible? I realize there's other classes, like lists, tbl, tbl_df, etc.

Scarabee
  • 5,437
  • 5
  • 29
  • 55
Eric Fail
  • 8,191
  • 8
  • 72
  • 128

7 Answers7

39

Edit: leaving the older solution at the bottom because it got a bounty and many votes but proposing an improved answer

You can use the {constructive} package :

# install.packages("constructive")
# or dev version:
# remotes::install_github("cynkra/constructive")
Df <- data.frame(A = c(2, 2, 2, 6, 7, 8),
                 B = c("A", "G", "N", NA, "L", "L"),
                 C = c(1L, 3L, 5L, NA, NA, NA))

constructive::construct(Df)
#> data.frame(
#>   A = c(2, 2, 2, 6, 7, 8),
#>   B = c("A", "G", "N", NA, "L", "L"),
#>   C = c(1L, 3L, 5L, NA, NA, NA)
#> )

It has custom constructors to many common classes so it should be able to reproduce most objects faithfully in a human readable way.


Old solution:

3 solutions :

  • a wrapper around dput (handles standard data.frames, tibbles and lists)

  • a read.table solution (for data.frames)

  • a tibble::tribble solution (for data.frames, returning a tibble)

All include n and random parameter which allow one to dput only the head of the data or sample it on the fly.

dput_small1(Df)
# Df <- data.frame(
#   A = c(2, 2, 2, 6, 7, 8),
#   B = structure(c(1L, 2L, 4L, NA, 3L, 3L), .Label = c("A", "G", "L", 
#     "N"), class = "factor"),
#   C = c(1L, 3L, 5L, NA, NA, NA) ,
#   stringsAsFactors=FALSE)

dput_small2(Df,stringsAsFactors=TRUE)
# Df <- read.table(sep="\t", text="
#   A   B   C
#   2   A    1
#   2   G    3
#   2   N    5
#   6   NA  NA
#   7   L   NA
#   8   L   NA", header=TRUE, stringsAsFactors=TRUE)

dput_small3(Df)
# Df <- tibble::tribble(
#   ~A, ~B, ~C,
#   2,           "A",          1L,
#   2,           "G",          3L,
#   2,           "N",          5L,
#   6, NA_character_, NA_integer_,
#   7,           "L", NA_integer_,
#   8,           "L", NA_integer_
# )
# Df$B <- factor(Df$B)

Wrapper around dput

This option that gives an output very close to the one proposed in the question. It's quite general because it's actually wrapped around dput, but applied separately on columns.

multiline means 'keep dput's default output laid out into multiple lines'.

dput_small1<- function(x,
                       name=as.character(substitute(x)),
                       multiline = TRUE,
                       n=if ('list' %in% class(x)) length(x) else nrow(x),
                       random=FALSE,
                       seed = 1){
  name
  if('tbl_df' %in% class(x)) create_fun <- "tibble::tibble" else
    if('list' %in% class(x)) create_fun <- "list" else
      if('data.table' %in% class(x)) create_fun <- "data.table::data.table" else
        create_fun <- "data.frame"
    
    if(random) {
      set.seed(seed)
      if(create_fun == "list") x <- x[sample(1:length(x),n)] else 
        x <- x[sample(1:nrow(x),n),]
    } else {
      x <- head(x,n)
    }
    
    line_sep <- if (multiline) "\n    " else ""
    cat(sep='',name," <- ",create_fun,"(\n  ",
        paste0(unlist(
          Map(function(item,nm) paste0(nm,if(nm=="") "" else " = ",paste(capture.output(dput(item)),collapse=line_sep)),
              x,if(is.null(names(x))) rep("",length(x)) else names(x))),
          collapse=",\n  "),
        if(create_fun == "data.frame") ",\n  stringsAsFactors = FALSE)" else "\n)")
}

dput_small1(list(1,2,c=3,d=4),"my_list",random=TRUE,n=3)
# my_list <- list(
#   2,
#   d = 4,
#   c = 3
# )

read.table solution

For data.frames I find it comfortable however to have the input in a more explicit/tabular format.

This can be reached using read.table, then reformatting automatically the type of columns that read.table wouldn't get right. Not as general as first solution but will work smoothly for 95% of the cases found on SO.

dput_small2 <- function(df,
                        name=as.character(substitute(df)),
                        sep='\t',
                        header=TRUE,
                        stringsAsFactors = FALSE,
                        n= nrow(df),
                        random=FALSE,
                        seed = 1){
    name
    if(random) {
      set.seed(seed)
      df <- df[sample(1:nrow(df),n),]
    } else {
      df <- head(df,n)
    }
  cat(sep='',name,' <- read.table(sep="',sub('\t','\\\\t',sep),'", text="\n  ',
      paste(colnames(df),collapse=sep))
  df <- head(df,n)
  apply(df,1,function(x) cat(sep='','\n  ',paste(x,collapse=sep)))
  cat(sep='','", header=',header,', stringsAsFactors=',stringsAsFactors,')')
  
  sapply(names(df), function(x){
    if(is.character(df[[x]]) & suppressWarnings(identical(as.character(as.numeric(df[[x]])),df[[x]]))){ # if it's a character column containing numbers
      cat(sep='','\n',name,'$',x,' <- as.character(', name,'$',x,')')
    } else if(is.factor(df[[x]]) & !stringsAsFactors) { # if it's a factor and conversion is not automated
      cat(sep='','\n',name,'$',x,' <- factor(', name,'$',x,')')
    } else if(inherits(df[[x]], "POSIXct")){
      cat(sep='','\n',name,'$',x,' <- as.POSIXct(', name,'$',x,')')
    } else if(inherits(df[[x]], "Date")){
      cat(sep='','\n',name,'$',x,' <- as.Date(', name,'$',x,')')
    }})
  invisible(NULL)
}

Simplest case

dput_small2(iris,n=6)

will print:

iris <- read.table(sep="\t", text="
  Sepal.Length  Sepal.Width Petal.Length    Petal.Width Species
  5.1   3.5 1.4 0.2  setosa
  4.9   3.0 1.4 0.2  setosa
  4.7   3.2 1.3 0.2  setosa
  4.6   3.1 1.5 0.2  setosa
  5.0   3.6 1.4 0.2  setosa
  5.4   3.9 1.7 0.4  setosa", header=TRUE, stringsAsFactors=FALSE)

which in turn when executed will return :

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

str(iris)
# 'data.frame': 6 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
# $ Species     : chr  " setosa" " setosa" " setosa" " setosa" ...

more complex

dummy data:

test <- data.frame(a=1:5,
                   b=as.character(6:10),
                   c=letters[1:5],
                   d=factor(letters[6:10]),
                   e=Sys.time()+(1:5),
                   stringsAsFactors = FALSE)

This:

dput_small2(test,'df2')

will print:

df2 <- read.table(sep="\t", text="
  a b   c   d   e
  1 6   a   f   2018-02-15 11:53:17
  2 7   b   g   2018-02-15 11:53:18
  3 8   c   h   2018-02-15 11:53:19
  4 9   d   i   2018-02-15 11:53:20
  5 10  e   j   2018-02-15 11:53:21", header=TRUE, stringsAsFactors=FALSE)
df2$b <- as.character(df2$b)
df2$d <- factor(df2$d)
df2$e <- as.POSIXct(df2$e)

which in turn when executed will return :

#   a  b c d                   e
# 1 1  6 a f 2018-02-15 11:53:17
# 2 2  7 b g 2018-02-15 11:53:18
# 3 3  8 c h 2018-02-15 11:53:19
# 4 4  9 d i 2018-02-15 11:53:20
# 5 5 10 e j 2018-02-15 11:53:21

str(df2)    
# 'data.frame': 5 obs. of  5 variables:
# $ a: int  1 2 3 4 5
# $ b: chr  "6" "7" "8" "9" ...
# $ c: chr  "a" "b" "c" "d" ...
# $ d: Factor w/ 5 levels "f","g","h","i",..: 1 2 3 4 5
# $ e: POSIXct, format: "2018-02-15 11:53:17" "2018-02-15 11:53:18" "2018-02-15 11:53:19" "2018-02-15 11:53:20" ...

all.equal(df2,test)
# [1] "Component “e”: Mean absolute difference: 0.4574251" # only some rounding error

tribble solution

The read.table option is very readable but not very general. with tribble pretty much any data type can be handled (though factors need adhoc fixing).

This solution isn't so useful for OP's example but is great for list columns (see example below). To make use of the output, library tibble is required.

Just as my first solution, it's a wrapper around dput, but instead of 'dputting' columns, i'm 'dputting' elements.

dput_small3 <- function(df,
                        name=as.character(substitute(df)),
                        n= nrow(df),
                        random=FALSE,
                        seed = 1){
  name
  if(random) {
    set.seed(seed)
    df <- df[sample(1:nrow(df),n),]
  } else {
    df <- head(df,n)
  }
  df1 <- lapply(df,function(col) if(is.factor(col)) as.character(col) else col)
  dputs   <- sapply(df1,function(col){
    col_dputs <- sapply(col,function(elt) paste(capture.output(dput(elt)),collapse=""))
    max_char <- max(nchar(unlist(col_dputs)))
    sapply(col_dputs,function(elt) paste(c(rep(" ",max_char-nchar(elt)),elt),collapse=""))
  })
  lines   <- paste(apply(dputs,1,paste,collapse=", "),collapse=",\n  ")
  output  <- paste0(name," <- tibble::tribble(\n  ",
                    paste0("~",names(df),collapse=", "),
                    ",\n  ",lines,"\n)")
  cat(output)
  sapply(names(df), function(x) if(is.factor(df[[x]])) cat(sep='','\n',name,'$',x,' <- factor(', name,'$',x,')'))
  invisible(NULL)
}

dput_small3(dplyr::starwars[c(1:3,11)],"sw",n=6,random=TRUE)
# sw <- tibble::tribble(
#   ~name, ~height, ~mass, ~films,
#   "Lando Calrissian", 177L,       79,                     c("Return of the Jedi", "The Empire Strikes Back"),
#      "Finis Valorum", 170L, NA_real_,                                                   "The Phantom Menace",
#       "Ki-Adi-Mundi", 198L,       82, c("Attack of the Clones", "The Phantom Menace", "Revenge of the Sith"),
#           "Grievous", 216L,      159,                                                  "Revenge of the Sith",
#     "Wedge Antilles", 170L,       77,       c("Return of the Jedi", "The Empire Strikes Back", "A New Hope"),
#         "Wat Tambor", 193L,       48,                                                 "Attack of the Clones"
# )
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 2
    Nice solutions... It would be nice to have also the possibility to take a random sample of the lines rather than the first n lines (for example for the iris dataset...). Easy to do for example with a `sample = NULL` function argument and then `if(!is.null(sample)) { df <- df[sample(1:nrow(df), sample),] } else { df <- head(df,n) }` In the function. – Gilles San Martin Feb 16 '18 at 21:56
  • 1
    Done, I implemented the `head` and `sample` features for both solutions, and made the first solution handle `lists` and `tibbles`. – moodymudskipper Feb 17 '18 at 19:07
  • I feel like this neat code belong to some package in the [tag:tidyverse]. Maybe [David Robinson](https://stackoverflow.com/users/712603/david-robinson) would be up for adding this to his [stackr](https://github.com/dgrtwo/stackr) package or something like it. – Eric Fail Feb 18 '18 at 10:17
  • `tidyverse` made me think of `tibble::tribble` so I added a third solution, and support for `data.table` in the first one – moodymudskipper Feb 18 '18 at 13:00
22

The package datapasta won't always work perfectly as it currently doesn't support all types, but it is clean and easy, i.e.,

# install.packages(c("datapasta"), dependencies = TRUE)    
datapasta::dpasta(Df)
#> data.frame(
#>            A = c(2, 2, 2, 6, 7, 8),
#>            C = c(1L, 3L, 5L, NA, NA, NA),
#>            B = as.factor(c("A", "G", "N", NA, "L", "L"))
#> )
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
MilesMcBain
  • 1,115
  • 10
  • 12
11

We could set control to NULL to simplify:

dput(Df, control = NULL)
# list(A = c(2, 2, 2, 6, 7, 8), B = c(NA, NA, NA, NA, 7, 9), C = c(1, 3, 5, NA, NA, NA))

Then wrap it with data.frame:

data.frame(dput(Df, control = NULL))

Edit: To avoid factor columns getting converted to numbers, we could convert them to character before calling dput:

dput_small <- function(d){
  ix <- sapply(d, is.factor)
  d[ix] <- lapply(d[ix], as.character)
  dput(d, control = NULL)
  }
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Interesting. I did look at this. This was actually what made me add a factor to the object. I apologize for not doing that from the start. – Eric Fail Feb 15 '18 at 08:07
  • How do you share data you have in R on SO? Accept `dput()`'s output at looks a bit clunky? – Eric Fail Feb 15 '18 at 08:08
  • 1
    @EricFail If small data, then use dput, if bigger I use `df1 <- read.table(text = "my delimited data")` but with read.table you will lose the attributes, so need to check the output if same as intended. – zx8754 Feb 15 '18 at 08:10
  • @EricFail see edit, also there is a [reprex](http://reprex.tidyverse.org/) package. – zx8754 Feb 15 '18 at 08:23
  • Very nice indeed. I wasen't aware of the [tag:reprex] package. It should have a function like this. – Eric Fail Feb 15 '18 at 08:27
  • @EricFail you could [submit issue at GitHub](https://github.com/tidyverse/reprex/issues) – zx8754 Feb 15 '18 at 08:28
  • 1
    Good point. I will see what comes up here in the next days and think about submitting something to [tag:reprex]. I imagine a [tag:reprex] function also would have to be compatible with `tbl`, `tbl_df`, `grouped_df` and maybe more. – Eric Fail Feb 15 '18 at 08:32
  • The tidyverse website advises using `dput` with `head` or `sample` http://reprex.tidyverse.org/articles/reprex.html – moodymudskipper Feb 16 '18 at 00:05
  • Really cool and simple solution ! You could add `cat("d <- as.data.frame(")` before `dput(d, control = NULL)` and `cat(")")` just after this line to make the output on the console copy/paste ready ! – Gilles San Martin Feb 16 '18 at 21:16
10

You could simply write to a compressed connection.

gz <- gzfile("foo.gz", open="wt")
dput(Df, gz)
close(gz)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • I'm not sure I understand this answer. Could you possibly show what kinda output this provide? – Eric Fail Feb 17 '18 at 19:11
  • Your original question (nearly 5 years ago!) said, "I would like `dput()` to compress the output in some way." My answer compresses the output using standard `gzip` compression. It was not clear that by "compress in some way", you meant, "change the text representation to be more understandable by humans." – Joshua Ulrich Feb 18 '18 at 00:38
  • Good point! Excellent point! It was after thinking about the original wording that I decided to reword it and put a bounty, I think five years ago I was too intimidated to ask for clarification from you. Regardless, what I meant to say (also then) was that I am looking for a clearer output to share structures on SO (and elsewhere). Thanks you for your feedback! – Eric Fail Feb 18 '18 at 10:10
  • I apologize fro the unclear wording when I initially posted this. – Eric Fail Feb 18 '18 at 10:18
3

Generally a large dput is difficult to cope with, on SO or otherwise. Instead you can just save the structure directly to an Rda file:

save(Df, file='foo.Rda')

And read it back in:

load('foo.Rda')

See this question for a little more info and credit where credit is due: How to save a data.frame in R?

You could also look at the sink function...

If I've missed the purpose of your question, please feel free to expand on the reasons why dput is the only mechanism for you.

Community
  • 1
  • 1
Justin
  • 42,475
  • 9
  • 93
  • 111
3

It might be worth mentioning memCompress and memDecompress here. For in-memory objects, it can reduce the size of large objects by compressing them as specified. And the latter reverses the compression. They're actually quite useful for package objects.

sum(nchar(dput(DF)))
# [1] 64
( mDF <- memCompress(as.character(DF)) )
# [1] 78 9c 4b d6 30 d2 51 80 20 33 1d 05 73 1d 05 0b 4d ae 64 0d 3f 47 1d 05 64 0c 14 b7 04 89 1b ea 28 18 eb 28 98 22 4b 6a 02 00 a8 ba 0c d2
length(mDF)
# [1] 46
cat(mdDF <- memDecompress(mDF, "gzip", TRUE))
# c(2, 2, 2, 6, 7, 8)
# c(NA, NA, NA, NA, 7, 9)
# c(1, 3, 5, NA, NA, NA)
nchar(mdDF)
# [1] 66

I haven't quite determined if the data frame can be reassembled easily, but I'm sure it can be.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks, interesting. I hope you realized I asked this question in Sep '13. However, I do appreciate your response. – Eric Fail Aug 02 '14 at 05:40
  • I did. I came across this post while searching for something else and it's a good question. Plus I've been using `memCompress` with some package data so I thought I'd share. – Rich Scriven Aug 02 '14 at 05:43
  • I appreciate you took the time to share and thanks for the nice words. – Eric Fail Aug 02 '14 at 06:07
3

There is also the read.so package, which I really like, in particular to read SO data. It works for tibbles as well.

#devtools::install_github("alistaire47/read.so")
Df <- data.frame(A = c(2, 2, 2, 6, 7, 8),
                 B = c("A", "G", "N", NA, "L", "L"),
                 C = c(1L, 3L, 5L, NA, NA, NA))

read.so::write.so(Df)

#> Df <- data.frame(
#>   A = c(2, 2, 2, 6, 7, 8),
#>   B = c("A", "G", "N", NA, "L", "L"),
#>   C = c(1L, 3L, 5L, NA, NA, NA)
#> )
tjebo
  • 21,977
  • 7
  • 58
  • 94