0

I have a quite simple R code, which seems not to be working as expected.

library(RColorBrewer)
library(ggplot2)
library(data.table)
library(tidyverse)

file <- fread(file)

file$V1 <- NULL

data <- file %>%
  select(col_a, col_b, col_c) %>%
  group_by(col_a, col_b) %>% 
  mutate(sum = sum(col_c)) 

Some data that could be relevant:

Classes ‘data.table’ and 'data.frame':    148195407 obs. of  3 variables:
 $ col_a      : int  85888 11111 99999 85888 11111 99999 85888 11111 99999 85888 ...
 $ col_b    : chr  "aaa" "bbb" "ccc" "aaa" ...
 $ col_c: int  4352 4352 52896 4352 48960 48960 48960 4352 27539 4352 ...

PROBLEM: It is not grouping the data according to col_a and col_b

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
CroatiaHR
  • 615
  • 6
  • 24
  • Could be that you have loaded `plyr` along with `dplyr` resulting in masking of `mutate` from dplyr use `dplyr::mutate(sum = sum(col_c))` – akrun May 09 '20 at 22:03
  • 1
    Also, have you tried `dat.atable` methods as it is a big dataset i.e `file[, Sum := sum(col_c), .(col_a, col_b)]` – akrun May 09 '20 at 22:04
  • the second comment sure solved the problem. I didn't know that datatable is more suitable for big datasets. – CroatiaHR May 09 '20 at 22:10
  • Not sure why the `tidyverse` is not working for you. I was thinking that it is due to masking of function. – akrun May 09 '20 at 22:12
  • could you please explain to me the function call you are using? what are the parameters meaning – CroatiaHR May 09 '20 at 22:14
  • Updated the post with some description. thanks – akrun May 09 '20 at 22:18

1 Answers1

0

As the OP's dataset is already a data.table (fread output is a data.table), and as it is a large dataset, it may be more efficient to use data.table methods

library(data.table)
file[, Sum := sum(col_c), by = .(col_a, col_b)]

Here, we group by 'col_a', 'col_b', and get the sum of 'col_c' in j and assign (:=) the output to create the 'Sum' column

If we need only a summary output, then instead of assignment,

file[, .(Sum = sum(col_c)), by = .(col_a, col_b)]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • is it possible to extend this call to two functions? instead of getting only one extra column with sum, also another one with max? – CroatiaHR May 14 '20 at 00:33
  • @CroatiaHR. You can do `.(Sum = sum(col_c), Max = max(col_c)),` – akrun May 14 '20 at 18:42