how to subset dataframe based on vector and then group by in dplyr

Question

I have following dataframe in R

 ID        Blocks
 123       2D
 345       2E
 324       2D
 567       4F
 678       5E
 444       2D
 145       2D

Now I want to subset dataframe based on a vector and then on each subset I want to apply group by

e.g vec <- c(2,3,2) Now First it should subset the dataframe with first two rows

 df <- df[1:2,]
 df %>%
 group_by(Blocks) %>%
 summarise(count = n())

Then it should subset the dataframe with next 3 rows

 df <- df[3:5,]
 df %>%
 group_by(Blocks) %>%
 summarise(count = n())

and so on. Desired output will be,

 Subset    2D    2E     4F     5E
  2         1     1     0      0
  3         1     0     1      1
  2         2     0     0      0

Individually I can do with above code,but my dataframe is huge and I have to subset it many times. How can we do it more effectively in R?

akrun · Accepted Answer · 2018-03-14T14:13:54.123

6

We can use rep to create a grouping variable for splitting the dataset into a list of 'data.frame', then with map, get the count

library(tidyverse)
split(df, rep(seq_along(vec), vec)) %>% 
              map(~ .x %>%
                      count(Blocks))

Or another option is to create a column of 'grp' based on 'vec' (to be more aligned with the expected output which is 'wide' dataset., along with 'Subset' that is actual values of 'vec', summarise to get the frequency and spread the data to 'wide' format

df %>%
   filter(row_number() <= sum(vec)) %>% #based on the comments from OP
   mutate(Subset = rep(vec, vec)) %>%
   group_by(grp = rep(seq_along(vec), vec), Blocks) %>%
   summarise(Subset = first(Subset), n = n())%>%
   ungroup %>%
   spread(Blocks, n, fill = 0) %>%
   select(-grp)
# A tibble: 3 x 5
#   Subset  `2D`  `2E`  `4F`  `5E`
#*  <dbl> <dbl> <dbl> <dbl> <dbl>
#1   2.00  1.00  1.00  0     0   
#2   3.00  1.00  0     1.00  1.00
#3   2.00  2.00  0     0     0

edited Mar 14 '18 at 14:13

answered Mar 14 '18 at 13:44

akrun

874,273
37
540
662

Thanks for your answer. Now my dataframe has `2086` observations and my `vec <- c(32,78,78)` when I run your code it gives me an error. `Error in mutate_impl(.data, dots) Column Subset` must be length 2086 (the number of rows) or one, not 188` – Neil Mar 14 '18 at 14:00
@Neil Here, the vector you showed is having a sum of 7 and the number of rows is 7. I don't understand the logic for replicating '2086' via 'vec'. In the comment, the sum for `vec` is 188. Could you specify the rules. It seems very vague – akrun Mar 14 '18 at 14:02
Idea is to subset the rows based on vector specified. e.g I have 100 rows and my `vec <- c(10,10,10)` then only for this rows it will calculate group by. for 1st 10 rows then 11 to 20 and 21:30 likewise. – Neil Mar 14 '18 at 14:06
What about after 21:30? Let's take your first comment i.e. `vec <- c(32, 78, 78)` What will be the grouping for 2086? – akrun Mar 14 '18 at 14:07
After 21:30 it will not run. – Neil Mar 14 '18 at 14:11
@Neil In that case, first do a subset of the dataset i.e. `df %>% filter(row_number() <= sum(vec)) %>% mutate(..` – akrun Mar 14 '18 at 14:13
@Neil I updated the post. Please check if that is what you meant – akrun Mar 14 '18 at 14:15
1

Perfect, thats what I needed. Thanks a ton :) – Neil Mar 14 '18 at 14:27
1

@akrun I found a quite simple solution based on one of your numerous genial answer :) – denis Mar 14 '18 at 15:25

denis · Answer 2 · 2018-03-14T15:23:34.747

I have an other answer using an other great idea of akrun here r - how to add columns dynamically based on numerical values sequences from another column. You then just need to sum over each group created by rep(seq(vec), vec), and the result is quite simple:

library(dplyr)
df2 <- as.data.frame.matrix(table(seq(nrow(df)),df$Blocks))
df2 %>%
  group_by(grp = rep(seq(vec), vec)) %>%
  summarise_all(sum)%>%
  mutate(grp=vec)

# A tibble: 3 x 5
    grp  `2D`  `2E`  `4F`  `5E`
  <dbl> <int> <int> <int> <int>
1     2     1     1     0     0
2     3     1     0     1     1
3     2     2     0     0     0

how to subset dataframe based on vector and then group by in dplyr

2 Answers2