Summing up all repeated values in a dataset

Question

I have a dataset in which in a column I have the name of a person and in another column I have the amount she was paid for a given service. I'd like to build a list with the names of all people ordained by the total amount they were paid regardless of the service they performed. Example:

Ann     100
John    200
Matt    150
John    150
John    150
Ann     300
Erik    150

===========
John    500
Ann     400
Matt    150
Erik    150

I figured this involves looking for all repeated instances of the name of the person and then storing the value paid on the column, eventually summing up everything. The problem is I have too big of a list and can't check individual names. That is, I can't define a particular string for each name to be checked, rather I'd like the program to figure the repeated instances by itself and return the ordained list in the manner that I described. Is there any way to do this? I know a bit of Python and R so any method described in these languages would be particularly helpful.

For an R method, pick your favorite answer at the [How to sum a variable by group](https://stackoverflow.com/q/1660124/903061) FAQ. I would suggest `library(dplyr); your_data %>% group_by(name) %>% summarize(value = sum(value))`, filling in your actual column names where I have use `name` and `value`. — Gregor Thomas, Nov 30 '21 at 17:38

score 1 · Accepted Answer · answered Nov 30 '21 at 17:13

1

If you made your dataset into a pandas dataframe this is easily done with groupby


import pandas as pd
df = pd.DataFrame({'name':names, 'paid':paid})


total_pay = df.groupby(by='name').sum()

answered Nov 30 '21 at 17:13

jeed

136
5

Having a small issue, the sum seems to be concatenating the numbers instead of summing then. Is there any way to solve this? My numbers are in the format XXX.YYY,ZZ. – some1else2 Dec 01 '21 at 15:41
Don't worry, I figured it out. Just had to convert "." to "", "," to "." and use agg(lambda x: x.astype(float).sum()). – some1else2 Dec 01 '21 at 16:06

rg255 · Answer 2 · 2021-12-01T14:09:58.577

1. Using the mtcars data in R, a base R approach would be to use the tapply() function and isolate vectors of your name and value data (cyl and hp respectively here)

tapply(mtcars$hp, mtcars$cyl, sum)

2. I'd personally use a data.table approach - data.tables are fast and the syntax is (I think) intuitive and readable whilst being concise

library(data.table)
dtcars <- data.table(mtcars, keep.rownames=TRUE)
dtcars[, sum(hp), keyby=cyl]

The data.table is in essence a data.frame on steroids — it can do way more than the base R data.frame and far more efficiently. Variables can be referred to by name (e.g. cyl instead of mtcars$cyl) from within the square braces. The notation dt[i, j, by] describes the basic functionality — i is used to sort or subset data (dtcars[order(mpg)]), j is used to select or process variables (dtcars[, mean(mpg)]) and by (and keyby to sort the output) allow the j-operation to be done by grouping variables. Note you'll first need to convert the data.frame to a data.table using setDT() or data.table().

3. Or you could used a tidyverse approach, piping data from statement to statement.

library(tidyverse)
mtcars %>% 
  group_by(cyl) %>%
  summarize(sum(hp))

Rfanatic · Answer 3 · 2021-12-01T15:00:41.690

1

Or in data.table

library(data.table)

df<-data.frame("Name" = c("Ann", "John", "Matt", "John", "John", "Ann", "Erik"), "val"=c(100, 200,150, 150, 150, 300,150))

df <- data.table(df)
df[ , .(val= sum(val)), by = "Name" ]

   Name val
1:  Ann 400
2: John 500
3: Matt 150
4: Erik 150

edited Dec 01 '21 at 15:00

answered Dec 01 '21 at 08:47

Rfanatic

2,224
1
5
21

Clever, I was thinking of something like this in R, but didn't know how to pull it off. Thank you! – some1else2 Dec 01 '21 at 12:29

Summing up all repeated values in a dataset

3 Answers3