0

Summary

Provided a dataframe in which I have several columns that are variables (each of them being numeric but one, which is a factor) and rows are observations,I would like to create a new column with the mean of all numeric columns + another one with a weighted mean of all numeric columns.

I have found quite some ways that apparently solve this problem (using dplyr, lapply, data.table... ) but none of them work with wide dataframes (and I am not sure I can convert it to long format -see below, and please be patient before marking as duplicate, as I haven't found any answer to my problem).

Long version:

I have a dataframe in wide format like the one provided below (the original one has more than 1700 observations of 20 variables grouped into 30 neighbourhoods) that is the result of of calculating the median of values of each variable:

df = data.frame(matrix(rnorm(15), nrow = 3))
df$neighbour = c("neighbour1", "neighbour2", "neighbour3")

df
> df
          X1         X2         X3         X4        X5  neighbour
1  1.0384405  0.6116994 -0.2075835  0.3206011 1.3855455 neighbour1
2 -0.5115649 -0.7722500  0.8374265 -1.3697758 0.1690452 neighbour2
3  1.0145282  0.6809156 -0.2918737  0.2912297 1.0689213 neighbour3

I would like to create

  • 1) a column named mean that is the mean of all numeric values (all columns but neighbour) and
  • 2) a wmean column with is the weighted mean of each column, where the weight is provided by the following vector: weight = c(.25, .05, .3, .3, .3)

My first attempt was using dplyr::mutate to create those columns, but I haven't succeeded, most likely because I'm doing it wrong (So If I haven't succeeded with a regular mean, I have no clue of how to perform a weighted mean):

df = df %>%
  mutate(mean = mean(select(-neighbour)))
Error in mutate_impl(.data, dots) : 
  argumento no válido para un operador unitario
> df = df %>%
+   mutate(mean = mean())
Error in mutate_impl(.data, dots) : 
  el argumento "x" está ausente, sin valor por omisión
> df = df %>%
+   mutate(mean = mean(is.numeric()))
Error in mutate_impl(.data, dots) : 
  0 arguments passed to 'is.numeric' which requires 1
> 

Also tried with mutate_each, but I'm assuming that my problem is that I do not know how to pass the right columns to calculate the mean (not to mention that I have no clue about weighted mean).

From what I have read there are many ways to create the desired columns:

  • This answer by Carlos Cinelli gives examples using sapply + filter, dplyr and tydr, but all these solutions are based on the fact they do not create a new column, with the median of each neighbour's observations but the median of each variables' values.

  • This answer by @Roland suggests to use data.table, but in order to be able to use it, my dataframe should have a column with the weight (whereas I do not have it and I'm afraid I wouldn't know how to create a column like that, provided that I have more than 1700 observations)

  • This answer by @Bob uses apply to create a mean of several columns (that's close to what I'm loooking for!) but still no clue of how to A) exclude the neighbour column, as otherwise it will fail, and B) to calculate the weighted mean.

Can anyone bring me some light with it? I am so ofuscated right now trying to solve this that I can't see the answer.

EDIT: As per @boshek's answer I have tried to convert from wide to long format and then applying summarise_each, but haven't succeeded neither:

df = df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour, variable) %>%
  summarise_each(., funs=mean)
Community
  • 1
  • 1
ccamara
  • 1,141
  • 1
  • 12
  • 32
  • Generally best to use `set.seed` before making a random-data example so it's easier to verify that answers work. For the same reason, I'd use `sample(100, n)` instead of `rnorm(n)`, just for readability. – Frank Dec 12 '16 at 19:48
  • Made some changes below – boshek Dec 12 '16 at 19:53
  • Fyi, Roland's way doesn't require a weight column. Simply using your weight vector works fine... `melt(data.table(df))[, .(mean(value), weighted.mean(value, weight)), by=neighbour, verbose=TRUE]` .. then you just need to merge / assign those values back. – Frank Dec 12 '16 at 19:56
  • Thanks for your suggestion, Frank. I've never used `set.seed` nor `sample()` I'll have to study how they work for future questions. – ccamara Dec 13 '16 at 09:51

4 Answers4

3
df$mean <- apply(df[1:5], 1, mean)
df$wt.mean <- apply(df[1:5], 1, weighted.mean, weight)
Jaap
  • 81,064
  • 34
  • 182
  • 193
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • 1
    no need to creat an anonymous function, `apply(df[1:5], 1, weighted.mean, weight)` will work just as well – Jaap Dec 12 '16 at 19:31
2

Ok - so you want means ACROSS the row?

I'd use gather from dplyr then merge it back with your original data:

df.mean <- df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour) %>%
  summarise(mean_value=mean(value), wmean_value=weighted.mean(value))

df.comb <- df %>%
  full_join(.,df.mean, by=c("neighbour"))

There are a few ways to skin this cat but this is one.

Is this what you wanted?

boshek
  • 4,100
  • 1
  • 31
  • 55
  • I'm afraid it doesn't do what I want to achieve, as it doesn't create a new column. – ccamara Dec 12 '16 at 18:41
  • try using `mutate_each` instead of `summarise_each`, and then make sure you store the result into an object. – Benjamin Dec 12 '16 at 18:47
  • Made some edits. I think it should address your query. – boshek Dec 12 '16 at 19:37
  • Thanks but I'm afraid I haven't made myself understood, as AFAIK `mutate_each` makes calculations within the column's values, which is not what I want to achieve. Instead I want to make the mean of each neighbour's observations - this is I want to make comptutations per row basis. – ccamara Dec 12 '16 at 21:11
  • Revised the solution. Is this what you wanted? If not can you post an example of what you were after? – boshek Dec 12 '16 at 21:26
  • Exactly! except for the weighted mean, it works great! Thanks for your answer. – ccamara Dec 13 '16 at 09:54
  • What is not working for the weighted mean? The `apply` approach will work fine but I am curious why my answer wasn't accepted? – boshek Dec 13 '16 at 16:27
1

I think the rowMeans() function in base may be your best bet.

df$mean <- rowMeans(dplyr::select(df, starts_with("X")))

The weighted mean may be more difficult. I couldn't find a quick and clean way to do it, but here's an option that works:

# define a function that calculates a weighted mean
wmean <- function(x, weight){
  stopifnot(length(x) == length(weight))
  if(sum(weight) != 1) {
    message("Rescaling weights to sum to 1")
    weight <- weight/sum(weight)
  }
  wx <- sum(x * weight)
  return(wx)
}
# apply that function row by row to the X columns in df
df$wmean <- apply(X=dplyr::select(df, starts_with("X")), MARGIN=1, FUN=wmean, weight = weight)
Rose Hartman
  • 457
  • 4
  • 11
0

I know I'm a bit late posting this, but I was looking for a solution to a similar problem and found the rowWeightedMeans from the matrixStats library, wich also supports na.rm, you only need to convert to matrix, it works as follows:

library(matrixStats)
df$wmean <- rowWeightedMeans(as.matrix(df[ , c('X1', 'X2', 'X3', 'X4', 'X5')]), w = weight)

This worked perfectly for me and as mentioned, has the extra that supports na.rm = TRUE wich I needed