0

I have a dataframe which looks as following:

head(test_df, n =15)
# print the first 15rows of the dataframe

               value frequency                        index
1  -2.90267705917358         1                            1
2  -2.90254878997803         1                            1
3  -2.90252590179443         1                            1
4  -2.90219354629517         1                            1
5  -2.90201354026794         1                            1
6   -2.9016375541687         1                            1
7  -2.90107154846191         1                            1
8  -2.90089440345764         1                            1
9  -2.89996957778931         1                            1
10 -2.89970088005066         1                            1
11 -2.89928865432739         1                            2
12 -2.89920520782471         1                            2
13 -2.89907360076904         1                            2
14 -2.89888191223145         1                            2
15  -2.8988630771637         1                            2

The dataframe has 3columns and 61819rows. To aggregate the dataframe, I want to get the mean value for the columns 'value' and 'frequency' for all rows with the same 'index'.

I already found some useful links, see:

https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/

Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?

However, I could not solve the problem yet.

test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)

   index value frequency
1      1    NA         1
2      2    NA         1
3      3    NA         1
4      4    NA         1
5      5    NA         1
6      6    NA         1
7      7    NA         1
8      8    NA         1
9      9    NA         1
10    10    NA         1
11    11    NA         1
12    12    NA         1
13    13    NA         1
14    14    NA         1
15    15    NA         1

Since I just get NA values for the column 'value', I wonder whether it might just be a data type issue?! However also when I tried to convert the data type I failed...

base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"
ExploreR
  • 313
  • 4
  • 15

3 Answers3

2

1. Here is a base R solution.

aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
#  index     value frequency
#1     1 -2.901523         1
#2     2 -2.899062         1

2. And a simple dplyr solution.

library(dplyr)

test_df %>%
  group_by(index) %>%
  summarize(across(1:2, mean))
## A tibble: 2 x 3
#  index value frequency
#* <int> <dbl>     <dbl>
#1     1 -2.90         1
#2     2 -2.90         1

Data

test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803, 
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687, 
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066, 
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145, 
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • thanks a lot for your helpful answer, I just tried your suggested base R solution! it runs through without throwing an error, however I do not get a meaningful result: for 'index' == 1 I receive a positive mean for 'value', which makes no sense since all input is negative. do you know what to do? – ExploreR May 18 '21 at 14:18
  • 1
    @ExploreR That may happen if `value` is of class factor. What does `class(value)` return? If I'm right, see [this SO post](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information). And [this one](https://stackoverflow.com/questions/27528907/how-to-convert-data-frame-column-from-factor-to-numeric/27528953). – Rui Barradas May 18 '21 at 16:50
2

Using data.table

library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Try tidyverse. test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T)).

Oh, and, of course, you should make sure you've checked the quality of your data and understand the ifs and whys of any NA's in your data set.

PeterK
  • 1,185
  • 1
  • 9
  • 23