1

In a grouped data frame, I want to generate a new column for the size (number of observations) of each group.

Whereas this is easy enough with dplyr functions, I struggle to find how to do it using tools from collapse package.

Let's take the mpg dataset from ggplot2. Once we group the data by manufacturer, we then want to compute an additional column that accounts for whether each row belongs to a group with of size > 10. With dplyr it's as easy as mutate(more_than_10_obs = n() > 10).

library(ggplot2) ## for the `mpg` dataset
library(dplyr, warn.conflicts = FALSE)   

mpg %>%
  select(manufacturer) %>%
  group_by(manufacturer) %>%
  mutate(more_than_10_obs = n() > 10)
#> # A tibble: 234 x 2
#> # Groups:   manufacturer [15]
#>    manufacturer more_than_10_obs
#>    <chr>        <lgl>           
#>  1 audi         TRUE            
#>  2 audi         TRUE            
#>  3 audi         TRUE            
#>  4 audi         TRUE            
#>  5 audi         TRUE            
#>  6 audi         TRUE            
#>  7 audi         TRUE            
#>  8 audi         TRUE            
#>  9 audi         TRUE            
#> 10 audi         TRUE            
#> # ... with 224 more rows

Created on 2021-08-24 by the reprex package (v2.0.0)

But I have very large data and I want to speed up processing time, therefore use tools from collapse package. So following this answer, I've tried to first generate a new column that simply gives the number of observations in the group that each row belongs to.

library(collapse)
library(ggplot2)
library(magrittr) ## for pipe operator

mpg %>%
  fselect(manufacturer) %>%
  fgroup_by(manufacturer) %>%
  ftransform(idx = seq_row(.)) %>%
  ftransform(n_in_group = fNobs(idx))
#> # A tibble: 234 x 3
#>    manufacturer   idx n_in_group
#>  * <chr>        <int>      <int>
#>  1 audi             1        234
#>  2 audi             2        234
#>  3 audi             3        234
#>  4 audi             4        234
#>  5 audi             5        234
#>  6 audi             6        234
#>  7 audi             7        234
#>  8 audi             8        234
#>  9 audi             9        234
#> 10 audi            10        234
#> # ... with 224 more rows
#> 
#> Grouped by:  manufacturer  [15 | 16 (11)]

Created on 2021-08-24 by the reprex package (v2.0.0)


My attempt is unsuccessful because column n_in_group gives the total number of rows in mpg, rather than the size of each group.

How can I utilize collapse tools to achieve the same output as in my dplyr-based code shown at the beginning?

Emman
  • 3,695
  • 2
  • 20
  • 44
  • I did not know `collapse`, it's a great package! Does this work for you: `mpg |> fgroup_by(manufacturer) |> GRP()`? – Vincent Guillemot Aug 24 '21 at 10:05
  • @VincentGuillemot, thanks. Your suggestion is in the right direction. However, it requires quite some wrangling to end up with the data format I need. Please note that had we done `mpg %>% fgroup_by(manufacturer) %>% ftransform(idx = seq_row(.)) %>% fsummarise(n_in_group = fNobs(idx))` we would have gotten pretty closer. Yet, I'm sure there's a more straightforward way to get the desired output directly. – Emman Aug 24 '21 at 10:09

1 Answers1

1

An equivalent method is:

library(collapse)
library(dplyr)

mpg %>%
  fselect(manufacturer) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)

# A tibble: 234 x 2
   manufacturer more_than_10_obs
 * <chr>        <lgl>           
 1 audi         TRUE            
 2 audi         TRUE            
 3 audi         TRUE            
 4 audi         TRUE            
 5 audi         TRUE            
 6 audi         TRUE            
 7 audi         TRUE            
 8 audi         TRUE            
 9 audi         TRUE            
10 audi         TRUE            
# ... with 224 more rows

Check that it gives the same result as dplyr:

collapse_res <- mpg %>%
  fselect(manufacturer) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)

dplyr_res <- mpg %>%
  select(manufacturer) %>%
  group_by(manufacturer) %>%
  mutate(more_than_10_obs  = n() > 10) %>%
  ungroup()

identical(dplyr_res, collapse_res)

[1] TRUE

Edit:

To group by multiple variables, simply pass them as a list:

mpg %>%
  fselect(manufacturer, class) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, list(manufacturer, class), TRA = "replace") > 10)
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • This answer is great, thank you! I now realize that I made a problematic assumption. In my real data I group by more than one variable. Now I don't know how to adapt your solution to a situation where group by several variables. If not too much of a trouble, could you please supplement your answer with how you'd use `fnobs()` to get an output equivalent to `mpg %>% group_by(manufacturer, class) %>% mutate(more_than_10_obs = n() > 10)`? – Emman Aug 24 '21 at 11:27
  • Also, if I may ask -- the `x` argument in `fnobs()` is currently set to `manufacturer`. Does it actually matter which column we choose to specify as `x`? – Emman Aug 24 '21 at 11:58
  • Not if `x` contains no missing values. – Ritchie Sacramento Aug 24 '21 at 19:45