I want to add the values of a column when the corresponding values of other columns are duplicates R

Question

I have a similar question to this: (Sum the duplicate rows of particular columns in dataframe) but the solution does not work for me or I don't know how to modify it.

I need the add the Number column values together if both Reference and NODCCODE match, even if NODCCODE are not adjacent within Reference number.

I have this:

structure(list(Reference = c("BBM101", "BBM102", 
                             "BBM102", "BBM102", "BBM103", "BBM103", 
                             "BBM104", "BBM105", "BBM105", "BBM105"), 
               NODCCODE = c("101","301", "201", "201", "201", "401", "401", "201", "102", "201"), 
               Number = c(2, 1, 3, 1, 3, 14, 3, 24, 2, 1)), 
          row.names = c(NA, 10L), class = "data.frame")

   Reference NODCCODE Number
1     BBM101      101      2
2     BBM102      301      1
3     BBM102      201      3
4     BBM102      201      1
5     BBM103      201      3
6     BBM103      401     14
7     BBM104      401      3
8     BBM105      201     24
9     BBM105      102      2
10    BBM105      201      1

I want this:

structure(list(Reference = c("BBM101", "BBM102", "BBM102", "BBM103", "BBM103", "BBM104", "BBM105", "BBM105"), 
               NODCCODE = c("101","301", "201", "201", "401", "401", "201", "102"), 
               Number = c(2, 1, 4, 3, 14, 3, 25, 2)), 
          row.names = c(NA, 8L), class = "data.frame")

Reference NODCCODE Number
1    BBM101      101      2
2    BBM102      301      1
3    BBM102      201      4
4    BBM103      201      3
5    BBM103      401     14
6    BBM104      401      3
7    BBM105      201     25
8    BBM105      102      2

notice that lines 3 and 4 Reference and NODCCODE were merged and Number column was added. Also lines 8 and 10 even though there was a 102 value in between the 201 values, all with the same Reference number, so they were added. I do not care whether the remaining rows are at the beginning or the end of that group of reference numbers.

`aggregate(Number~Reference+NODCCODE, df, sum)` – Ronak Shah Apr 25 '20 at 03:56 — Ronak Shah, Apr 25 '20 at 03:56

score 1 · Accepted Answer · answered Apr 25 '20 at 02:02

I believe simple as this with the tidyverse? The sum of Reference with only one matching NODCCODE will be the unique value, entries with the same reference and NODCCODE will be summed

library(tidyverse)

struct <- structure(list(Reference = c("BBM101", "BBM102", 
                             "BBM102", "BBM102", "BBM103", "BBM103", 
                             "BBM104", "BBM105", "BBM105", "BBM105"), 
               NODCCODE = c("101","301", "201", "201", "201", "401", "401", "201", "102", "201"), 
               Number = c(2, 1, 3, 1, 3, 14, 3, 24, 2, 1)), 
          row.names = c(NA, 10L), class = "data.frame")


result <- struct %>% 
  group_by(Reference,NODCCODE) %>% 
  summarise(Number = sum(Number)) %>% 
  arrange(Reference) %>% 
  ungroup()

result
#> # A tibble: 8 x 3
#>   Reference NODCCODE Number
#>   <chr>     <chr>     <dbl>
#> 1 BBM101    101           2
#> 2 BBM102    201           4
#> 3 BBM102    301           1
#> 4 BBM103    201           3
#> 5 BBM103    401          14
#> 6 BBM104    401           3
#> 7 BBM105    102           2
#> 8 BBM105    201          25

^{Created on 2020-04-24 by the reprex package (v0.3.0)}

score 1 · Answer 2 · answered Apr 25 '20 at 02:15

1

If you load the data.table package convert your data.frame to a data.table (using setDT) you can do this

unique(dt1[, Number := sum(Number), by = c("Reference", "NODCCODE")])

Here the dt[i, j, by] notation is used by reassigning the sum of number to number, with that function being done by unique combinations of reference and NODCCODE, and wrapping that in a unqiue() function to get the unqiue rows.

answered Apr 25 '20 at 02:15

rg255

4,119
3
22
40

Thanks, I'll try that. Is data.table generally better to use than data.frame? – Johnny5ish Apr 25 '20 at 02:42
1

Data.tables come from a package and are designed to build upon and improve upon data.frames, they are more efficient in terms of both computational work and syntax. Because they build on data.frames they are quite easy to incorporate without much work and over time you can unleash their full usefulness as you learn more about them – rg255 Apr 25 '20 at 02:55
Thank you, your answer worked great, but I went with the other since I have more columns and it doesn't matter for that code. – Johnny5ish Apr 25 '20 at 03:16
1

What do you.mean by more columns? To sum? – rg255 Apr 25 '20 at 05:25
There area additional columns in my data set that have nothing to do with my question. When I tried your code I had to remove them all first to get it to work. – Johnny5ish Apr 30 '20 at 15:40

I want to add the values of a column when the corresponding values of other columns are duplicates R

2 Answers2