Calculate mean for each group, many groups, huge group_id, in R

Question

I have a big dataset in R, containing: items(item id) and review scores by clients. It is groupped by item code, each item appears one or more times in dataset. How can I mutate one column that has the mean of the review scores for each item?

looks like

   gt3dskh     4
   gt3dskh     3
   o1gsjnv     5
   abg56ji     4
   abg56ji     1
   ...

but, many more rows, item id unpredictable, do not know how many rows each group has.

Thank you in advance!

score 1 · Answer 1 · answered Jan 31 '22 at 01:05

Try this base R approach with aggregate

aggregate(score ~ id, df, mean)
       id score
1 abg56ji   2.5
2 gt3dskh   3.5
3 o1gsjnv   5.0

Data

df <- structure(list(id = c("gt3dskh", "gt3dskh", "o1gsjnv", "abg56ji",
"abg56ji"), score = c(4L, 3L, 5L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-5L))

score 1 · Answer 2 · answered Jan 31 '22 at 01:09

@shawn-hemelstrand has the right answer, but I'll just add that if you truly had a big dataset, you should use data.table not tidyverse

library(data.table)
setDT(df)
df[,.(mean_val = mean(score,na.rm=T)), by=id]

Output:

       id mean_val
1: gt3dskh      3.5
2: olgsjnv      5.0
3: abg56ji      2.5

score 0 · Answer 3 · answered Jan 31 '22 at 01:02

You can do this fairly easily with the tidyverse package:

Load tidyverse here:

library(tidyverse)

Then I'll just recreate your data frame here:

Dataframe

id <- c("gt3dskh",
        "gt3dskh",
        "olgsjnv",
        "abg56ji",
        "abg56ji")
score <- c(4,3,5,4,1)
df <- data.frame(id,score)

Finally, you just need to use the this code to group each id and then summarize the mean for each.

Summarize scores:

df %>% 
  group_by(id) %>% 
  summarize(mean = mean(score))

This should produce this output:

  id       mean
  <chr>   <dbl>
1 abg56ji   2.5
2 gt3dskh   3.5
3 olgsjnv   5

Observing a large dataset

Also just to piggyback off your issue of dealing with large datasets...there are multiple ways to get a cursory look at the data.

Here is one:

glimpse(df)

Which gives you this output showing the rows, columns, variables, and values attributed:

Rows: 5
Columns: 2
$ id    <chr> "gt3dskh", "gt3dskh", "olgsjnv", "abg56ji", "abg56ji"
$ score <dbl> 4, 3, 5, 4, 1

There is also a nice function from the skimr library that can be helpful for summarizing a lot of data.

library(skimr)
skim(df)

Which gives you a lot of valuable output as well if you want it:

-- Data Summary ------------------------
                           Values
Name                       df    
Number of rows             5     
Number of columns          2     
_______________________          
Column type frequency:           
  character                1     
  numeric                  1     
________________________         
Group variables            None  

-- Variable type: character ---------------------------------------------
# A tibble: 1 x 8
  skim_variable n_missing complete_rate   min   max empty n_unique
* <chr>             <int>         <dbl> <int> <int> <int>    <int>
1 id                    0             1     7     7     0        3
  whitespace
*      <int>
1          0

-- Variable type: numeric -----------------------------------------------
# A tibble: 1 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 score                 0             1   3.4  1.52     1     3     4
    p75  p100 hist 
* <dbl> <dbl> <chr>
1     4     5 ▃▁▃▇▃