5

I want to create a data frame using describe() function. Dataset under consideration is iris. The data frame should look like this:

    Variable    n   missing unique  Info    Mean    0.05    0.1   0.25  0.5    0.75 0.9   0.95
   Sepal.Length 150    0    35      1       5.843   4.6     4.8   5.1   5.8    6.4  6.9   7.255
   Sepal.Width  150    0    23      0.99    3.057   2.345   2.5   2.8   3      3.3  3.61  3.8
Petal.Length    150    0    43      1       3.758   1.3     1.4   1.6   4.35   5.1  5.8   6.1
 Petal.Width    150    0    22      0.99    1.199   0.2     0.2   0.3   1.3    1.8  2.2   2.3
     Species    150    0    3                                   

Is there a way out to coerce the output of describe() to data.frame type? When I try to coerce, I get an error as shown below:

library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame

Thanks

skumar
  • 353
  • 2
  • 4
  • 12
  • You should modify the code for `describe.vector` and alter it so that it produces numeric output of a constant length. – IRTFM Jun 19 '16 at 15:40
  • @akrun - the table in my post is expected output. Thank you for sharing your inputs. – skumar Jun 19 '16 at 17:17

3 Answers3

7

The way to figure this out is to examine the objects with str():

data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris 
# 
# 5  Variables      150  Observations
# -------------------------------------------------------------
# Sepal.Length 
#       n missing  unique    Info    Mean     .05     .10     .25     .50     .75     .90     .95 
#     150       0      35       1   5.843   4.600   4.800   5.100   5.800   6.400   6.900   7.255
# 
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9 
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species 
#       n missing  unique 
#     150       0       3 
# 
# setosa (50, 33%), versicolor (50, 33%) 
# virginica (50, 33%) 
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript    : chr "Sepal.Length"
# ..$ units       : NULL
# ..$ format      : NULL
# ..$ counts      : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values      : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species     :List of 5
# ..$ descript: chr "Species"
# ..$ units   : NULL
# ..$ format  : NULL
# ..$ counts  : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values  : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"

We see that di is a list of lists. We can take it apart by looking at just the first sublist. You can convert that into a vector:

unlist(di[[1]])
#             descript              counts.n 
#       "Sepal.Length"                 "150" 
#       counts.missing         counts.unique 
#                  "0"                  "35" 
#          counts.Info           counts.Mean 
#                  "1"               "5.843" 
#           counts..05            counts..10 
#              "4.600"               "4.800" 
#           counts..25            counts..50 
#              "5.100"               "5.800" 
#           counts..75            counts..90 
#              "6.400"               "6.900" 
#           counts..95   intervalFreq.range1 
#              "7.255"                 "4.3" 
#  intervalFreq.range2   intervalFreq.count1 
#                "7.9"                   "1" 
#  ...
#            values.H3             values.H2 
#                "7.6"                 "7.7" 
#            values.H1 
#                 "7.9" 
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...

It is very, very long (125). The elements have been coerced to all be of the same (and most inclusive) type, namely, character. It seems you want the 2nd through 12th elements:

unlist(di[[1]])[2:12]
#     counts.n counts.missing  counts.unique    counts.Info 
#        "150"            "0"           "35"            "1" 
#  counts.Mean     counts..05     counts..10     counts..25 
#      "5.843"        "4.600"        "4.800"        "5.100" 
#   counts..50     counts..75     counts..90 
#      "5.800"        "6.400"        "6.900" 

Now you have something you can start to work with. But notice that this only seems to be the case for numerical variables; the factor variable species is different:

unlist(di[[5]])
#     descript       counts.n counts.missing  counts.unique 
#    "Species"          "150"            "0"            "3" 
#      values1        values2        values3        values4 
#         "50"           "33"           "50"           "33" 
#      values5        values6 
#         "50"           "33" 

In that case, it seems you only want elements two through four.

Using this process of discovery and problem solving, you can see how you'd take the output of describe apart and put the information you want into a data frame. However, this will take a lot of work. You'll presumably need to use loops and lots of if(){ ... } else{ ... } blocks. You might just want to code your own dataset description function from scratch.

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
  • 1
    One possible starting fpoint for this sort of effort might be: `mtx <- do.call(rbind, sapply(statistics , "[[", "counts")[1:3])`. It is a bit annoying for this effort that the result is character, but that is how Frank handles the varying precision of the columns. – IRTFM Jun 19 '16 at 15:42
  • That's a great start, @42-. It still seems like it's going to take a bit of tedium to get it the rest of the way (eg, the recycling of the vector from the factor variable). I think my preference would still be to decide what I want & code it from scratch. – gung - Reinstate Monica Jun 19 '16 at 16:09
  • @gung - thank you so much for sharing such a descriptive email. This is really helpful. It has solved my purpose. – skumar Jun 19 '16 at 17:08
  • @42- thank you for giving a pointer to get the required output with a shorter approach using do.call and sapply functions, instead of following a longer approach. I think, we can treat numeric and factor variables separately as shown below to get required output: num_vars <- do.call(rbind, sapply(statistics , "[[", "counts")[1:4]) fact_var <- do.call(rbind, sapply(statistics , "[[", "counts")[5]) rbind.fill(as.data.frame(num_vars), as.data.frame(fact_var)) – skumar Jun 19 '16 at 17:15
  • A furter refinement: Consider adding `print(as.data.frame(mtx))` – IRTFM Jun 19 '16 at 19:15
5

You can do this by using the stat.desc function from the pastecs package:

library(pastecs)
summary_df <- stat.desc(mydata) 

The summary_df is the dataframe you wanted. See more info here.

Vlad
  • 912
  • 9
  • 9
2

In R, you just have to use the summary(iris) function instead of describe(iris) function in Python.