6

I have an excel file with multiple rows and columns (13232 rows and 18 columns). The last column gives some value. What I want to do is - find the rows with same column details except the last one and sum their last column value. For example: if input is

+---------+---------+---------+---------+
| Column1 | Column2 | Column3 | Column4 |
+---------+---------+---------+---------+
| ABC     | DEF     | GHI     |       5 |
| XYZ     | PQR     | LMN     |       4 |
| ABC     | DEF     | GHI     |      11 |
| Test1   | Test2   | Test3   |      12 |
| XYZ     | PQR     | LMN     |      54 |
+---------+---------+---------+---------+

then output should be

+---------+---------+---------+---------+
| Column1 | Column2 | Column3 | Column4 |
+---------+---------+---------+---------+
| ABC     | DEF     | GHI     |      16 |
| XYZ     | PQR     | LMN     |      58 |
| Test1   | Test2   | Test3   |      12 |
+---------+---------+---------+---------+

How this can be acheived in R?

1 Answers1

6

You can use aggregate from base R

 aggregate(Column4~., df1, FUN=sum)
 #    Column1 Column2 Column3 Column4
 #1     ABC     DEF     GHI      16
 #2     XYZ     PQR     LMN      58
 #3   Test1   Test2   Test3      12

Or

 library(data.table)
 setDT(df1)[, list(Column4=sum(Column4)), by = c(names(df1)[1:3])]
 #     Column1 Column2 Column3 Column4
 #1:     ABC     DEF     GHI      16
 #2:     XYZ     PQR     LMN      58
 #3:   Test1   Test2   Test3      12

Or

 library(sqldf)
 sqldf('select Column1, Column2, Column3,
          sum(Column4) as Column4
          from df1 
          group by Column1, Column2, Column3')
 #   Column1 Column2 Column3 Column4
 #1     ABC     DEF     GHI      16
 #2   Test1   Test2   Test3      12
 #3     XYZ     PQR     LMN      58

Or

library(dplyr)
df1 %>% group_by(Column1, Column2, Column3) %>%
  summarize(Column4 = sum(Column4))
# Source: local data frame [3 x 4]
# Groups: Column1, Column2

#   Column1 Column2 Column3 Column4
# 1     ABC     DEF     GHI      16
# 2   Test1   Test2   Test3      12
# 3     XYZ     PQR     LMN      58

Reproducible data:

df1 <-
structure(list(Column1 = structure(c(1L, 3L, 1L, 2L, 3L), .Label = c("ABC", 
"Test1", "XYZ"), class = "factor"), Column2 = structure(c(1L, 
2L, 1L, 3L, 2L), .Label = c("DEF", "PQR", "Test2"), class = "factor"), 
    Column3 = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("GHI", 
    "LMN", "Test3"), class = "factor"), Column4 = c(5L, 4L, 11L, 
    12L, 54L)), .Names = c("Column1", "Column2", "Column3", "Column4"
), class = "data.frame", row.names = c(NA, -5L))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I didn't know you could use aggregate so directly like that. Excellent. I thought you had to give it a more clear variable. – Brandon Bertelsen Jul 09 '15 at 18:13
  • @BrandonBertelsen We can use either the formula method or the one with the list as you showed. Thank you for the comments – akrun Jul 09 '15 at 18:15