1

I have data frame like this dummy sample, my real dataset had 56 variables. I would like to drop the date and aggregate by id and sum last 4 total variables while keep the other unchanged.

df <- data.frame(stringsAsFactors=FALSE,
          date = c("2019-02-10", "2019-02-10", "2019-02-11", "2019-02-11",
                   "2019-02-12", "2019-02-12", "2019-02-13", "2019-02-13",
                   "2019-02-14", "2019-02-14"),
            id = c("18100410-aa", "18101080-ae", "18100410-aa", "18101080-ae",
                   "18100410-aa", "18101080-ae", "18100410-aa", "18101080-ae",
                   "18100410-aa", "18101080-ae"),
        f_type = c(4L, 2L, 4L, 2L, 4L, 2L, 4L, 2L, 4L, 2L),
           reg = c(6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L),
        hh_p10 = c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L),
      internet = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L),
      youngest = c(5L, 7L, 5L, 7L, 5L, 7L, 5L, 7L, 5L, 7L),
       a_group = c(3L, 6L, 3L, 6L, 3L, 6L, 3L, 6L, 3L, 6L),
     total_prd = c(130L, 337L, 374L, 261L, 106L, 230L, 150L, 36L, 15L, 123L),
   B_totalprod = c(20L, 0L, 256L, 0L, 32L, 0L, 0L, 36L, 0L, 45L),
   p_totalprod = c(0L, 81L, 11L, 260L, 26L, 230L, 0L, 0L, 15L, 0L),
   n_totalprod = c(110L, 256L, 107L, 1L, 48L, 0L, 150L, 0L, 0L, 78L)
)

I found this solution from plyr package here it is working but I need to specify all my 52 unaffected variables. I am just wondering is there any other way to do this task?

library(plyr)
ddply(df,.(id,f_type, reg, internet,hh_p10 ,youngest, a_group ),summarise,total_prd = sum(total_prd) ,
      B_totalprod = sum(B_totalprod) , p_totalprod = sum(p_totalprod) ,
      n_totalprod = sum(n_totalprod))
DanG
  • 689
  • 1
  • 16
  • 39
  • You can use the dplyr `mutate` function to create newly derived columns then the `select` function to delete the columns you no longer need. The swirl tutorial package has a module "Getting and Cleaning Data" that explains the functions for you as well as the piping construct. – SteveM Feb 18 '19 at 20:37

1 Answers1

2

If your real dataset also has columns that contain "total" this should work:

library(tidyverse)
df %>%
  select(-date) %>%
  group_by(.dots = str_subset(names(.), "total", negate = TRUE)) %>%
  summarise_all(list(sum = sum))

# A tibble: 2 x 11
# Groups:   id, f_type, reg, hh_p10, internet, youngest [2]
  id          f_type   reg hh_p10 internet youngest a_group total_prd_sum B_totalprod_sum p_totalprod_sum n_totalprod_sum
  <chr>        <int> <int>  <int>    <int>    <int>   <int>         <int>           <int>           <int>           <int>
1 18100410-aa      4     6      2        1        5       3           775             308              52             415
2 18101080-ae      2     7      1        2        7       6           987              81             571             335

The line group_by(.dots = str_subset(names(.), "total", negate = TRUE)) means we are going to group by all the column names in our this dataset that do not contain the word "total".

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Thanks, but when I use your code I got this error **Error in str_subset(names(.), "total", negate = TRUE) : unused argument (negate = TRUE)** – DanG Feb 18 '19 at 21:04
  • @DanielG try install the latest version of `stringr` via `install.packages("stringr")` - you may have to restart R. – JasonAizkalns Feb 18 '19 at 21:10
  • Yes, I used only `library(stringr)` and apparently didnot have the latest version. Its all good now! Thanks @JasonAizkalns – DanG Feb 18 '19 at 21:16