Finding the average for certain unique categories in a data frame

Question

If I have an R data frame that looks like this:

| Value | TestNum | RepNum |
|:-----:|:-------:|:------:|
| 104   |       1 |      1 |
| 101   |       1 |      2 |
| 101   |       1 |      3 |
| 100   |       2 |      1 |
| 100   |       2 |      2 |
| 100   |       2 |      3 |
| 90    |       3 |      1 |
| 90    |       3 |      2 |
| 90    |       3 |      3 |
| 91    |       4 |      1 |
| 94    |       4 |      2 |
| 94    |       4 |      3 |
| 105   |       5 |      1 |
| 105   |       5 |      2 |
| 108   |       5 |      3 |

Is there a way that I can modify this data frame to find the average for the 3 RepNum Values for each unique TestNum so that it looks like this:

| Mean | TestNum |
|:----:|:-------:|
| 102  |       1 |
| 100  |       2 |
| 90   |       3 |
| 93   |       4 |
| 106  |       5 |

You can create this example data frame in R by copying and pasting this code and executing it.

Value<-c(100,101,100,100,100,100,90,90,90,93,94,94,105,105,108)
TestNum<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
RepNum<-c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)

df<-data.frame(Value,TestNum,RepNum)

EDIT: Here's a "more complete" example of the data frame I'm starting with an what I'm hoping to end up with:

| FileName | Version |  Category | Value | TestNum | RepNum |
|:--------:|:-------:|:---------:|:-----:|:-------:|:------:|
| File1    | 1.0.1   | Category1 |   104 |       1 |      1 |
| File1    | 1.0.1   | Category1 |   101 |       1 |      2 |
| File1    | 1.0.1   | Category1 |   101 |       1 |      3 |
| File1    | 1.0.2   | Category1 |   100 |       2 |      1 |
| File1    | 1.0.2   | Category1 |   100 |       2 |      2 |
| File1    | 1.0.2   | Category1 |   100 |       2 |      3 |
| File1    | 1.0.4   | Category1 |    90 |       3 |      1 |
| File1    | 1.0.4   | Category1 |    90 |       3 |      2 |
| File1    | 1.0.4   | Category1 |    90 |       3 |      3 |
| File1    | 1.0.5   | Category1 |    94 |       4 |      1 |
| File1    | 1.0.5   | Category1 |    91 |       4 |      2 |
| File1    | 1.0.5   | Category1 |    94 |       4 |      3 |
| File1    | 1.0.8   | Category1 |   105 |       5 |      1 |
| File1    | 1.0.8   | Category1 |   105 |       5 |      2 |
| File1    | 1.0.8   | Category1 |   108 |       5 |      3 |

And ending with this:

| FileName | Version |  Category | Mean_Value | TestNum |
|:--------:|:-------:|:---------:|:----------:|:-------:|
| File1    | 1.0.1   | Category1 |        102 |       1 |
| File1    | 1.0.2   | Category1 |        100 |       2 |
| File1    | 1.0.4   | Category1 |         90 |       3 |
| File1    | 1.0.5   | Category1 |         93 |       4 |
| File1    | 1.0.8   | Category1 |        106 |       5 |

As you've probably noticed, there is only 1 unique entry for both the FileName column and the Category column. The Version column is changing along with the TestNum column. So it seems like it might be easiest to simply add in these other columns after I've found the mean.

In the "full" code that I'm working on, I'm getting the mean values for several different files and many unique categories but I've been creating multiple data frames that are created by subsetting an original data frame on the FileName and Category (and an additional "Case" column).

Possible duplicate of [Mean per group in a data.frame](https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame) — Jealie, Jun 08 '17 at 16:51

d.b · Answer 1 · 2017-06-08T16:53:11.660

2

You can use aggregate

aggregate(x = df$Value, by = list(df$TestNum), FUN = mean)
#  Group.1         x
#1       1 100.33333
#2       2 100.00000
#3       3  90.00000
#4       4  93.66667
#5       5 106.00000

You could also split based on unique values of TestNum first and then summarize

data.frame(test_num = unique(df$TestNum), mean_value = sapply(split(df$Value, df$TestNum), mean))
#  test_num mean_value
#1        1  100.33333
#2        2  100.00000
#3        3   90.00000
#4        4   93.66667
#5        5  106.00000

edited Jun 08 '17 at 16:53

answered Jun 08 '17 at 16:47

d.b

32,245
6
36
77

Thanks! This works perfectly for the example I gave. One thing I noticed though is that if I extrapolate it to a data frame with more columns than the example data frame had, it leaves me with the Column in the "by" parameter and the mean of the x column. Is there a way that I can use the aggregate function to more or less "collapse" an existing data frame so that it retains other data (it's all repeating for each TestNum) and replaces the Value column with the new "Mean" column? (And drops the RepNum column of course) – Neal Jun 08 '17 at 16:55
I can add an edit with a table to show what I mean if that wasn't clear. (I felt like it wasn't clear lol) – Neal Jun 08 '17 at 16:56
sort of. However, the more I think about it, the more I think it will be easier to actually just "restructure" the resulting data frame because I don't think what I was thinking about makes logical sense to do in R. I'll throw an edit in the original example just in case there is an easy way to do it that someone out there knows how to do. – Neal Jun 08 '17 at 17:09
I got an error: `Error in FUN(X[[i]], ...) : 'x' must be an array of at least two dimensions` And `In split.default(df[,c(1,3)], df$TestNum) : data length is not a multiple of split variable` – Neal Jun 08 '17 at 17:25

amatsuo_net · Answer 2 · 2017-06-08T17:47:19.480

1

Also with data.table and dplyr, you can do

library(data.table)
setDT(df)[, mean(Value), by = TestNum]

library(dplyr)
df %>% group_by(TestNum) %>% summarise(mean(Value))

If there are other columns, you can use the other columns' first value in each TestNum. Like this:

df2<-data.frame(FileName = "File1", 
                Version = paste0("1.0.", rep(c(1,2,4,5,8), each = 3)),
                Value, TestNum, RepNum)


## data.table 
keep_cols <- c("FileName", "Version")
setDT(df2)[, c(lapply(.SD, function(x) x[1]), mean_Value = mean(Value)), 
           by = TestNum, .SDcols = keep_cols]

## dplyr
df2 %>% group_by(TestNum) %>% summarise(FileName = FileName[1], 
                                        Version = Version[1], 
                                        mean_Value = mean(Value))

edited Jun 08 '17 at 17:47

answered Jun 08 '17 at 16:52

amatsuo_net

2,409
11
20

You should take a look at the comment I left on d.b's answer. Because I think the `data.table` method of doing this is the cleanest and simplest way to do it but I had a caveat I put in the comments. – Neal Jun 08 '17 at 17:03
I'm not calculating additional columns. See my edit in the question for better detail. – Neal Jun 08 '17 at 17:30
When I tried this, I got back 98 for every entry in the `mean_Values` column. It looks like it took the average of all the values. (For the "data.table" method). It looks like your "dplyr" method works however. – Neal Jun 08 '17 at 17:59
Also, would it not be better to create the "Version" column using a vector created like `Versions<-unique(df$Version)`? This way it is easier than having to continually update that parameter. – Neal Jun 08 '17 at 18:03

Finding the average for certain unique categories in a data frame

2 Answers2