Summarise data by two grouping variables

Question

I have a large csv recorded as below.

name             score          year
team_a           4              2005
team_b           6              2005
team_a           2              2005
team_c           7              2005
team_d           3              2005
team_d           4              2005

team_a           2              2006
team_b           4              2006
team_b           3              2006
team_c           4              2006
team_c           2              2006
team_d           1              2006
team_e           5              2006

I would like to perform a downsample as below (add total scores of every year and have the recorded instead)

name           total_score    year
team_a         6              2005
team_b         6              2005
team_c         7              2005
team_d         7              2005

team_a         2              2006
team_b         7              2006
team_c         6              2006
team_d         1              2006
team_e         5              2006

Thoughts on this?

Okay, what's going on [here](https://stackoverflow.com/q/60339263/9494044)? — Dunois, Feb 21 '20 at 16:56
Same as I commented on [a very similar post](https://stackoverflow.com/q/60339263/9494044) today (your classmate?), it's best if you can show what you've tried so far, and narrow it down to 1 language. Some quick research on summing a value by a group should get you lots of SO posts like the one I flagged above — camille, Feb 21 '20 at 16:59

ipj · Answer 1 · 2020-02-21T17:02:33.317

2

My answer is Python solution. First please refer to pandas manual, for example (https://pandas.pydata.org/docs/) then try the following:

import pandas as pd
df = pd.read_csv(path_to_file)
df = df.groupby(["name","year"]).sum().reset_index()

However if file is very large, chunking method may be useful, see: how to read only a chunk of csv file fast?

edited Feb 21 '20 at 17:02

answered Feb 21 '20 at 16:34

ipj

3,488
1
14
18

1

Doesn't change the code fundamentally, but they want the sum, not mean – camille Feb 21 '20 at 17:00
My mistake, already fixed the answer, thx :-) – ipj Feb 21 '20 at 17:03

Till · Answer 2 · 2020-02-21T16:43:55.203

In R and tidyverse terms this would be a summarization of the scores by the groups given by name and year.

First lets generate your example data.

team_score_year_df <- tibble::tribble(
   ~name,~score,~year,
  "team_a", 4, 2005,
  "team_b", 6, 2005,
  "team_a", 2, 2005,
  "team_c", 7, 2005,
  "team_d", 3, 2005,
  "team_d", 4, 2005,
  "team_a", 2, 2006,
  "team_b", 4, 2006,
  "team_b", 3, 2006,
  "team_c", 4, 2006,
  "team_c", 2, 2006,
  "team_d", 1, 2006,
  "team_e", 5, 2006
  )

Now we use dplyr::group_by() and dplyr::summarise() to achieve your desired result.

library(dplyr)

team_score_year_df %>% 
  group_by(name, year) %>% 
  summarise(total_score = sum(score)) %>% 
  select(name, total_score, year) # In case order of columns is important.
#> # A tibble: 9 x 3
#> # Groups:   name [5]
#>   name   total_score  year
#>   <chr>        <dbl> <dbl>
#> 1 team_a           6  2005
#> 2 team_a           2  2006
#> 3 team_b           6  2005
#> 4 team_b           7  2006
#> 5 team_c           7  2005
#> 6 team_c           6  2006
#> 7 team_d           7  2005
#> 8 team_d           1  2006
#> 9 team_e           5  2006

Edit: Base R solution

As G5W points out in his comment, stats::aggregate() can do this as well.

result_df <- aggregate(
  team_score_year_df$score,
  list(name = team_score_year_df$name,
       year = team_score_year_df$year),
  sum
)

names(result_df)[3] <- "total_score"

result_df[c("name", "total_score", "year")]
#>     name total_score year
#> 1 team_a           6 2005
#> 2 team_b           6 2005
#> 3 team_c           7 2005
#> 4 team_d           7 2005
#> 5 team_a           2 2006
#> 6 team_b           7 2006
#> 7 team_c           6 2006
#> 8 team_d           1 2006
#> 9 team_e           5 2006

score 0 · Accepted Answer · answered Feb 21 '20 at 16:59

0

Assuming your data is in csv format here is a base R solution:

scores = read.csv("scores.csv")
result = aggregate(data=scores, score ~ name+year, FUN=sum)
colnames(result)[3] = "total_score"

answered Feb 21 '20 at 16:59

Gregory

26
4

Summarise data by two grouping variables

3 Answers3