1

I am trying to apply different functions to different columns for a grouped data, but I am unsure why one of the functions isn't working. I did research this topic on R - Group data but apply different functions to different columns this thread, but I believe they are not using lapply in that example. In my example, I want to concatenate all ID and apply sum to all numeric columns.

Here's my Input data:

dput(Input)
structure(list(ID = c(1, 2, 3, 4, 5, 6), Name = c("Boston", "Boston", 
"Boston", "Boston", "Seattle", "Washington"), Total_Groceries = c(35, 
70, 95, 120, 10, 20), Apple = c(5, 10, 15, 20, 2, 10), Banana = c(10, 
20, 30, 40, 5, 8), Pineapple = c(20, 40, 50, 60, 3, 2)), .Names = c("ID", 
"Name", "Total_Groceries", "Apple", "Banana", "Pineapple"), row.names = c(NA, 
6L), class = "data.frame")

Here's my code:

 Input<-data.table::as.data.table(Input)

  Input[,as.list(ID=paste0(ID,collapse=";"),unlist(lapply(.SD, sum, na.rm=TRUE))),by=.(Name),.SDcols=c("Total_Groceries","Apple","Banana","Pineapple")]

Running above code, you would see that I am not getting concatenated string for ID

Here's expected output:

dput(Output)
structure(list(ID = c("1;2;3;4", "5", "6"), Name = c("Boston", 
"Seattle", "Washington"), Total_Groceries = c(320, 10, 20), Apple = c(50, 
2, 10), Banana = c(100, 5, 8), Pineapple = c(170, 3, 2)), .Names = c("ID", 
"Name", "Total_Groceries", "Apple", "Banana", "Pineapple"), row.names = c(NA, 
3L), class = "data.frame")

As a bonus (and for my learning, it would really help if you could post an answer based on dplyr. I am primarily looking for data.table based answer because of the sheer size of original data.

I'd appreciate any thoughts.

Community
  • 1
  • 1
watchtower
  • 4,140
  • 14
  • 50
  • 92
  • 1
    What about this `df[, c(list(ID = paste(ID, collapse = ';')), lapply(.SD, sum)), by =Name, .SDcols = names(df)[-(1:2)]]` – mt1022 May 07 '17 at 01:37
  • @mt1022 - That worked beautifully! Thank you. Do you think you could help me with `dplyr` based code? – watchtower May 07 '17 at 01:42
  • In data.table, it should be less slow if you handle the pasting in a separate operation. See `?GForce`. In dplyr, you probably need to do it that way (mutate the pasted column, then group on it, then summarise_each / summarise_at the other columns). – Frank May 07 '17 at 01:46
  • @Frank, I modified the above code to `scols <- names(df)[-(1:2)]; df[, c(list(paste(ID, collapse = ';')), lapply(.SD, sum)), by =Name, .SDcols = scols, verbose = T]`; It seems GForce is on. But I don't kown hwo to assign a new column name for new ID column, it is `V1` now. Could you show me how to assign new colname with `by` that will still keep GForce on? – mt1022 May 07 '17 at 04:28
  • 1
    @mt1022 `df[, id2 := paste(ID, collapse=";"), by=Name][, c(.(ID = first(id2)), lapply(.SD, sum)), by =Name, .SDcols = sapply(df, is.numeric), verbose = T]` you can see it's "on" in the sense of doing something thanks to the "GForce optimised j to..." line. I'm not sure if this is the best way, but I bet it's faster when there are enough groups involved. – Frank May 07 '17 at 04:40

2 Answers2

2

It isn't the most elegant solution, but here's one way you could do this with dplyr.


library(tibble)
library(dplyr)

input <- tibble::tribble(
  ~ID,         ~Name, ~Total_Groceries, ~Apple, ~Banana, ~Pineapple,
  1L,      "Boston",              35L,     5L,     10L,        20L,
  2L,      "Boston",              70L,    10L,     20L,        40L,
  3L,      "Boston",              95L,    15L,     30L,        50L,
  4L,      "Boston",             120L,    20L,     40L,        60L,
  5L,     "Seattle",              10L,     2L,      5L,         3L,
  6L,  "Washington",              20L,    10L,      8L,         2L
)

input %>% 
  group_by(Name) %>% 
  mutate_at(vars(ID), funs(paste(., collapse = ";"))) %>% 
  group_by(Name, ID) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE)
#> Source: local data frame [3 x 6]
#> Groups: Name [?]
#> 
#>         Name      ID Total_Groceries Apple Banana Pineapple
#>        <chr>   <chr>           <int> <int>  <int>     <int>
#> 1     Boston 1;2;3;4             320    50    100       170
#> 2    Seattle       5              10     2      5         3
#> 3 Washington       6              20    10      8         2
austensen
  • 2,857
  • 13
  • 24
  • I suspect `input %>% group_by(Name, ID = paste(as.character(ID), collapse = ";")) %>% summarise_if(is.numeric, sum, na.rm = TRUE)` works; can't test since apparently I don't have summarise_if. If that doesn't work, maybe `group_by(Name) %>% group_by(ID = ..., add = TRUE) %>% summarise_if(...)`. – Frank May 07 '17 at 03:44
  • 2
    @Frank, the first one will paste full column together, therefore doesn't work. The second one works. – mt1022 May 07 '17 at 04:24
1

Another data.table method that follows Frank's advise of splitting up the operations to allow for Gforce optimization of sum is

df[, paste(ID, collapse=";"), by=Name
   ][df[, lapply(.SD, sum), .SDcols=Total_Groceries:Pineapple, by=Name],
   on="Name"]
         Name      V1 Total_Groceries Apple Banana Pineapple
1:     Boston 1;2;3;4             320    50    100       170
2:    Seattle       5              10     2      5         3
3: Washington       6              20    10      8         2

The idea is to first collapse ID, then merge on a data.table that calculates the sum of the remaining columns by Name. The selection of columns to sum are explicitly listed in the .SDcols argument. Note that if you add verbose=TRUE in the i argument of the second chain, you will sees the GForce is activated in this calculation.

lmo
  • 37,904
  • 9
  • 56
  • 69