From list to data frame with tidyverse, selecting specific list elements

Question

A simple question but I've searched for a solution, and so far to no avail.

Say that I have a list object, and I want to pull specific list elements and output them side-by-side as dataframe columns. How can I achieve this with tidyverse/piping in a simple way? Attempt to solve it below.

Data

some_data <-
structure(list(x = c(23.7, 23.41, 23.87, 24.18, 24.15, 24.31, 
23.14, 23.72, 24.12, 23.47, 23.59, 23.29, 23.24, 23.5, 23.56, 
23.16, 23.62, 23.67, 23.84, 23.69, 23.7, 23.68, 24.2, 23.77, 
23.74, 23.64, 24.39, 24.05, 24.51, 23.6, 24.29, 23.31, 23.96, 
24.07, 24.37, 23.77, 23.64, 24, 23.68, 24.02, 23.36, 23.54, 23.34, 
23.69, 23.79, 23.8, 23.7, 24.45, 23.27, 23.57, 23.02, 24.23, 
23.41, 23.6, 24.02, 23.94, 24.06, 23.97, 23.38, 23.46, 24, 23.89, 
23.51, 23.72, 23.83, 23.96, 23.84, 23.52, 24.36, 23.94, 23.82, 
24.04, 24.05, 23.6, 23.52, 24.13, 23.43, 23.33, 24.01, 23.99, 
24.46, 24.23, 24.19, 23.83, 23.8, 23.93, 23.79, 23.48, 23.26, 
24.04, 23.93, 23.98, 23.86, 23.49, 24.17, 23.7, 23.54, 23.55, 
23.67, 23.66)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -100L), spec = structure(list(cols = list(
    x = structure(list(), class = c("collector_double", "collector"
    ))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

I want the value output of the `hist()` function for this data

library(tidyverse)

some_data$x %>% 
   as.numeric() %>% 
   hist(breaks = seq(from = 23, to = 24.6, by = 0.2),
        plot = FALSE)

## $breaks
## [1] 23.0 23.2 23.4 23.6 23.8 24.0 24.2 24.4 24.6

## $counts
## [1]  3  9 20 23 19 16  7  3

## $density
## [1] 0.15 0.45 1.00 1.15 0.95 0.80 0.35 0.15

## $mids
## [1] 23.1 23.3 23.5 23.7 23.9 24.1 24.3 24.5

## $xname
## [1] "."

## $equidist
## [1] TRUE

## attr(,"class")
## [1] "histogram"

So let's say that I want both `$breaks` and `$counts` side by side as a data frame

I will supplement the original pipe so that:

some_data$x %>% 
   as.numeric() %>% 
   hist(breaks = seq(from = 23, to = 24.6, by = 0.2),
        plot = FALSE) %>%
##
   map_df(~.[1:30]) %>%
   select(bins = breaks, 
          frequency = counts)
##

## # A tibble: 30 x 2
##     bins frequency
##    <dbl>     <int>
##  1  23           3
##  2  23.2         9
##  3  23.4        20
##  4  23.6        23
##  5  23.8        19
##  6  24          16
##  7  24.2         7
##  8  24.4         3
##  9  24.6        NA
## 10  NA          NA
## # ... with 20 more rows

So yes, it does work, but in map_df() I had to put a relatively large "magic" number (arbitrarily I put 30) to ensure all data is included. Is there a simpler way to get $breaks and $counts as a dataframe? Maybe even with just one step instead of combining map_df() and then select()?

COMMENT

While this specific problem demonstrated the case of a histogram class, my general question isn't about histograms, but principle about list objects. The nice thing about the output of hist(plot = FALSE) is that it generates an object with unequal-length elements, which is a demonstration of a problem that needs a flexible solution to account for the variance in element length.

SOLUTION

Based on Rémi Coulaud's (chosen) solution below, the way to address the situation of unequal lengths of list elements is to make them equal, anchoring to the lengthiest element. Then, it's not a problem anymore. The working pipe is as follows:

library(tidyverse)

some_data$x %>% 
  as.numeric() %>% 
  hist(breaks = seq(from = 23, to = 24.6, by = 0.2),
       plot = FALSE) %>%
  lapply(., `length<-`, max(lengths(.))) %>%  ## make all elements as the length of the longest one
  map_df(~.) %>%
  select(bins = breaks, 
         frequency = counts)

Thanks!

score 3 · Answer 1 · answered Dec 15 '19 at 12:28

We can use imap and enframe to convert each element in the list to a data frame with name (row number) and value (the element name). We can then use reduce and full_join to join all data frames. Finally, we can select the columns we want. This approach does not need to specify a "magic" number.

library(tidyverse)

some_data$x %>% 
  as.numeric() %>% 
  hist(breaks = seq(from = 23, to = 24.6, by = 0.2),
       plot = FALSE) %>%
  imap(~enframe(.x, value = .y)) %>%
  reduce(full_join, by = "name") %>%
  select(bins = breaks, 
         frequency = counts)
# # A tibble: 9 x 2
#   bins frequency
#   <dbl>     <int>
# 1  23           3
# 2  23.2         9
# 3  23.4        20
# 4  23.6        23
# 5  23.8        19
# 6  24          16
# 7  24.2         7
# 8  24.4         3
# 9  24.6        NA

Thanks you. While the result is what I'm looking for, I'm hoping for a simpler method. In attempting to avoid magic numbers, is there a way to reference to the length of the longest list element, while still inside the pipe? An example (which isn't working) would be: `some_data$x %>% as.numeric() %>% hist(breaks = seq(from = 23, to = 24.6, by = 0.2), plot = FALSE) %>% map_df(~.[1:max(lengths(.))])` Can we tweak this `1:max(lengths(.))` for something that does work? — Emman, Dec 15 '19 at 14:42

Cole · Answer 2 · 2019-12-15T15:10:32.077

Part of the complicating factor is that the lists of a hist() object have different lengths:

library(tidyverse)

brks <- seq(from = 23, to = 24.6, by = 0.2)

hist_res <- some_data$x %>% 
  as.numeric() %>% 
  hist(breaks = brks,
       plot = FALSE)

lengths(hist_res)

  breaks   counts  density     mids    xname equidist 
       9        8        8        8        1        1

OP commented that uneven lists is a main part of the question. We need to make a choice or rule to determine which list elements are selected for a data.frame. In this case, we can select the most frequent length using a combination of table(), which(), and base [. For this hist() example, I still include manually manipulating the breaks column in a mutate call:

l <- lengths(hist_res)
cols <- which(l == as.integer(names(table(l)))[which.max(table(l))])

hist_res%>%
  .[cols]%>%
  as_tibble()%>%
  mutate(brk_start = brks[-length(brks)],
         brk_end = brks[-1])

# A tibble: 8 x 5
  counts density  mids brk_start brk_end
   <int>   <dbl> <dbl>     <dbl>   <dbl>
1      3   0.15   23.1      23      23.2
2      9   0.45   23.3      23.2    23.4
3     20   1.000  23.5      23.4    23.6
4     23   1.15   23.7      23.6    23.8
5     19   0.95   23.9      23.8    24  
6     16   0.8    24.1      24      24.2
7      7   0.35   24.3      24.2    24.4
8      3   0.150  24.5      24.4    24.6

Thanks. You're right, the complexity is with unequal lengths of list elements. Your solution is clever, but it narrows down to the 2 elements I specified in my question. I'm looking for a more broad solution that accounts for _any_ subset of list elements. — Emman, Dec 15 '19 at 14:45
See edit. This is complicated - it's hard to programatically handling lists of different lengths without a rule. I also go outside of the pipe but there are likely ways to make the pipe work. — Cole, Dec 15 '19 at 15:13

Rémi Coulaud · Accepted Answer · 2019-12-15T15:33:09.853

The best answer I found for the first question about histogram question is here.

I was triying to do the same indeed you have no need to use hist function because at the end you want a data.frame.

One solution is :

library(tidyverse)
breaks <- seq(from = 23, to = 24.6, by = 0.2)
df <- data.frame(breaks = breaks,
           frequency = c(some_data$x %>% 
  as.numeric() %>%
  findInterval(vec = breaks) %>%
  tabulate(), NA))

df

The NA is needed because you have less count than breaks values.

EDIT 1

The specificity of hist class must be taking into account. Like say @Cole. If you want a solution for list object you should look at the answer below.

If your question is only to pass from a list to a data.frame. It is maybe more appropriate to choose an exemple with just a list. Moreover if we don't have the problem of passing from a hist class to a data.frame. There is no questions. Indeed, list in r are the same as data.frame. So you can just do:

library(dplyr)
l <- list(breaks = c(1, 2, 3, 4),
          counts = c(10, 34, 54, 78),
          other = rep("A", 4))

If tibble is needed:

l %>% as_tibble %>% select(breaks:counts)

If you want a data.frame:

l %>% data.frame

I hope it clarrify a bit your question.

Edit 2

For list with unequal length elements see there. I lengths gives you the length of each element of the list. After normalizing all elements at the same size with:

lapply(l, `length<-`, max(lengths(l)))

You just have to bind them and transform it to a data.frame. You can use dplyr syntaxe throughout pipe but it works also like this:

as.data.frame(do.call(cbind, lapply(l, `length<-`, max(lengths(l)))))

With pipe:

lapply(l, `length<-`, max(lengths(l))) %>%
  do.call(what = cbind) %>%
  data.frame

In conclusion it seems compulsory to specify the maximum length to after that create a data.frame.

The length<- see there, function gives you all elements from the begining to the value that you give, 5 in my exemple. If your vector is shorter it introduces automaticly NA values.

For instance:

l <- list(breaks = c(1, 2, 3, 4),
          counts = c(10, 34, 54, 78),
          other = rep("A", 4),
          diff = rep("B", 3))

`length<-`(l$breaks, 5)
[1]  1  2  3  4 N

Thanks. You're right, histogram-wise, there are probably many other ways to get the same result. However, I was using this case example with `hist()`.just to illustrate my universal question about extracting list elements and arranging them in a data frame, while accounting for the situation of non-equal length of elements. This is why the question isn't framed as a histogram problem. — Emman, Dec 15 '19 at 14:23
Thanks! I basically took only the ``lapply(l, `length<-`, max(lengths(l)))`` line you proposed and implemented it in my pipe, and it solved everything. I'll edit the post to reflect the mechanism of solution. — Emman, Dec 15 '19 at 15:38
I'm glad be able to help you, but it is mainly thanks to akrun answer. — Rémi Coulaud, Dec 15 '19 at 15:40