How to efficiently create a table of nrow over a series of data frames?

Question

UPDATE Using different solutions found throughout the site:

I still cannot achieve the desired output using the stack and ldply functions:

The desired output would look like this:

  Dataset              Samples
1     WGS        nrow(WGS.ped)
2     WES    nrow(WES.ped.exp)
3    MIPS   nrow(MIPS.ped.exp)

1) ldply: How to assign a name to columns V1 and .id?

ldply(list(WGS=WGS.ped, WES=WES.ped.exp, MIPS=mips.ped.exp), 
      function(l)(Samples=nrow(l)))

   .id    V1
1  WGS  3908
2  WES 26367
3 MIPS 14193

2) ldply: How to assign a name to columns V1 and .id?

ldply(list(WGS=WGS.ped, WES=WES.ped.exp, MIPS=mips.ped.exp), nrow)

   .id    V1
1  WGS  3908
2  WES 26367
3 MIPS 14193

3) lapply %>% as.data.frame : Returns the data frame names as columns, instead of as a first column 'Dataset'.

lapply(list(WGS=WGS.ped, WES=WES.ped.exp, MIPS=mips.ped.exp), nrow) %>% 
  as.data.frame

   WGS   WES  MIPS
1 3908 26367 14193

4) sapply %>% stack : How to reverse the order of the columns? And how to indicate column names with stack?

sapply(list(WGS=WGS.ped, WES=WES.ped.exp, MIPS=mips.ped.exp), nrow) %>% 
  stack()

  values  ind
1   3908  WGS
2  26367  WES
3  14193 MIPS

5) map %>% as.data.frame : Returns the data frame names as columns, instead of as a first column 'Dataset'.

map(list(WGS=WGS.ped, WES=WES.ped.exp, MIPS=mips.ped.exp), nrow) %>% 
  as.data.frame()

 WGS   WES  MIPS 
 3908 26367 14193

I have three data frames WGS.ped, WES.ped,exp and MIPS.ped.exp.

I want to create a new data frame that summarizes their row count / the total number of rows in each data frame.

The desired output would look like this:

Dataset Samples
WGS     nrow(WGS.ped)
WES     nrow(WES.ped.exp)
MIPS    nrow(MIPS.ped.exp)

What is an efficient and reproducible way to achieve this, preferably with dplyr?

Thanks!

If you actually have many a data.frame, then check out my answer in [this post](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames). — lmo, Apr 21 '18 at 22:09
The `sapply` solution *almost* works with `stack`, but I can't quite format it to the desired output. I just updated with an example. — Carmen Sandoval, Apr 21 '18 at 23:13

Marcus Campbell · Accepted Answer · 2018-04-22T02:17:51.400

6

Okay, this was especially fun to figure out. Here's a revised solution that only requires dplyr. It takes advantage of the base function mget, which builds us a named list of our dataframes by grabbing them from our R environment after we pass it a vector of names to look for.

Following that, it's just a matter of using .id in bind_rows() to create a "dummy" column of the dataframe names, which lets us neatly group and summarise.

library(dplyr)

# Load some built-in dataframes to use as an example
df1 <- mtcars
df2 <- iris
df3 <- PlantGrowth

names_list <- c("df1","df2","df3")
summary_df <- mget(names_list, envir = globalenv()) %>%
              bind_rows(.id = "Dataset") %>%
              group_by(Dataset) %>%
              summarise(Samples = n())

# Output
# A tibble: 3 x 2
  Dataset Samples
  <chr>     <int>
1 df1          32
2 df2         150
3 df3          30

edited Apr 22 '18 at 02:17

answered Apr 21 '18 at 22:36

Marcus Campbell

2,746
4
22
36

Thanks! I didn't know about, `map` ; but how is it different to `lapply`? I wish there was a way to more efficiently create the final table after `map` / `lapply` . i.e., without having to manually specify the names column and the values column. – Carmen Sandoval Apr 21 '18 at 23:18
What do you mean by "efficient"? Do you mean using less code? Or do you mean using less computer resources? Personally, I think it's usually a good idea to use the method which is the most *readable*. – Marcus Campbell Apr 21 '18 at 23:20
I agree, but I'd like to avoid the manual specification of the columns for the final table. Later in the script I'd like to replicate the table using a new set of data frames. – Carmen Sandoval Apr 21 '18 at 23:22
Hmm okay. That's an interesting request. Let me think about it for a bit. – Marcus Campbell Apr 21 '18 at 23:26
Thanks for the update! I'm following you up until `group_by(Dataset)` -- After running `summarise(nrows = n())`, I get: `Error: This function should not be called directly`. Where are you specifying the name of the summary column as `Samples`? – Carmen Sandoval Apr 22 '18 at 02:08
Never mind, looks like it's a conflict in my session with plyr's `summarise`. Trying it out again... – Carmen Sandoval Apr 22 '18 at 02:10
This is such an elegant solution! FYI, I tried it without specifying the environment to `mget`and it works as well. – Carmen Sandoval Apr 22 '18 at 02:12
I learned some new things in the process of getting it to this point, so thank you for pushing me to give you a more elegant solution. – Marcus Campbell Apr 22 '18 at 02:14

score 0 · Answer 2 · answered Apr 22 '18 at 01:17

Here's a base r function that will summarize data frames you pass to it:

summarize_data <- function(...) {

  data <- list(...)

  call <- as.character(match.call())

  names <- gsub(".*\\((.*)\\).*", "\\1", call)[-1]


  data.frame(names = names,
             rows = sapply(data, nrow),
             stringsAsFactors = FALSE)

}

This gets:

> summarize_data(mtcars, iris)


   names rows
1 mtcars   32
2   iris  150

How to efficiently create a table of nrow over a series of data frames?

2 Answers2