0

I have 7 columns lets say A-G. In a data frame of 100s to 1000s of entries/observations. Column A-E & G have duplicates and Column F has new data. I want to aggregate or combine all these into on row summing Column F.

I have tried aggregate and group_by and results are not as desired.

A           B   C         D         E            F       G
IP1260341   1   6415504 26621   P1RRRRZ9999930S 412.25  LIS
IP1260224   1   6415504 26621   P1RRRRZ9999930S 208     LIS
IP1258742   1   6415504 25235   P1RRRRS0003300S 2775.58 LIS
IP1257207   1   6415504 25723   P1RRRRS0003400S 108     LIS
IP1259352   1   6415504 25235   P1RRRRS0003400S 150     LIS
IP1259623   1   6W98W09 25223   EMPL-05104NCTR  925     LIS
IP1260241   1   6415504 25235   P1RRRRS0003400S 360     LIS
IP1257207   1   6415504 25723   P1RRRRS0003400S 108     LIS
IP1259335   1   6415504 26335   P1RRRRP0016600S 9.64    LIS

The one below is an example of duplicate data that I need to aggregate column F

IP1259489   1   6415504 25435   P1RRRRS0003500S 196.8   LIS
IP1259489   1   6415504 25435   P1RRRRS0003500S 166.8   LIS
IP1259489   1   6415504 25435   P1RRRRS0003500S 572.4   LIS
IP1259489   1   6415504 25435   P1RRRRS0003500S 40      LIS

IP1260846   1   6999DFM 26509   14CDERS0078900S 23.75   UCH
IP1260846   1   6999DFM 26509   14CDERS0078900S 1980    UCH
IP1260967   1   6415544 26621   19PHCEE0769501S 283.36  UCH
IP1260746   1   6415515 26509   P1RRRRE0006900S 248     UCH
IP1260846   1   6999DFM 26509   14CDERS0078900S 699.65  UCH
IP1260849   1   6999DFM 25732   14CDERS0078900S 711.3   UCH
Edo
  • 7,567
  • 2
  • 9
  • 19
Kyle Overton
  • 49
  • 1
  • 7

1 Answers1

1

Yours looks like a classic use case that can be handled by using the dplyr library.

Have a look at ?summarise for more examples and detailed explanations.

dplyr helps you writing code in a sql-like manner.

With group_by you're telling your program to consider your dataframe as divided in groups. summarise will then simplify each group in one row. The column F in this case will be summed accordingly.

It's good practice to end a group_by statement tht contains multiple variables with ungroup. That's because when your data is summarise-ized the last group is automatically removed [G in this case]. However, the other groups [in this case A and E] stay logically attached to the result and it may unexpectedly affect future results.

library(dplyr)

df %>% 
    group_by(A, E, G) %>% 
    summarise(F = sum(F)) %>%
    ungroup()

Each dplyr verb requires as first argument a dataframe and returns a dataframe. That's why %>% works perfectly. %>% sets as first argument of the function on the left the result of the function on the right. So basically, the verbs are communicating consequently in a flawless fashion.

Hope it helps.

Edo
  • 7,567
  • 2
  • 9
  • 19
  • 1
    Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes. – Mark Rotteveel Aug 07 '20 at 06:52
  • You are most definitively right. I edited my answer to provide more informative details. Thanks for the suggestion. – Edo Aug 07 '20 at 07:57
  • Using: df %>% group_by(A, E, G) %>% summarise(F = sum(F)) %>% ungroup() I get 1 obs and 1 variable, not my desired outcome. I need to consolidate(aggregate) based on A but add any values in F together to corresponding A. This sums all values in F. – Kyle Overton Aug 10 '20 at 15:05
  • I don't understand what you need. Do you want to sort the A elements? – Edo Aug 10 '20 at 15:24
  • I want to sum the F values by the A values. Column A has multiple listings and in column F have new values. When I use the DF %>% group_by(A,E,G) %>% summarise(F = sum(F)) %>% the result is just 1 observation and 1 variable not 7 variables as in 7 columns minus the duplicate rows in A with summed values for those dups in F – Kyle Overton Aug 10 '20 at 15:33
  • so this? ` df %>% group_by(A, B, C, D, E, G) %>% summarise(F = sum(F)) %>% ungroup() ` – Edo Aug 10 '20 at 15:39