How to calculate the mean of y for x=1

Question

I am trying to figure out mean of clusters, which I've assigned using cluster = sample(1:2,n,replace=T). For n=50 and for function x = rnorm(n), y=rnorm(n).

Then I created a data frame so that I could see x,y and its clusters that are randomly assigned.

data = data.frame(x,y,cluster)

Which then I got the result:

           x          y    cluster
1  -0.89691455  0.41765075   2
2   0.18484918  0.98175278   1
3   1.58784533 -0.39269536   1
4  -1.13037567 -1.03966898   1
5  -0.08025176  1.78222896   2
6   0.13242028 -2.31106908   2
7   0.70795473  0.87860458   2
8  -0.23969802  0.03580672   1
9   1.98447394  1.01282869   2
10 -0.13878701  0.43226515   2

What I now wanted to do was to get the mean of the clusters. That is, what is the mean of cluster 1 and 2?

So what I did was:

m1 = sum(data[data$C==1])/sum(data$cluster==1)

Which doesn't give me the value I wanted. What I was expecting was mean of all values from x and y combined in cluster 1 and 2.

`aggregate(.~cluster, data, mean)` or `aggregate(cbind(x, y)~cluster, data, mean)` to be more specific. — Ronak Shah, Feb 06 '19 at 01:48
that gives me separate mean for x and y according to clusters. I was expecting the mean for cluster=1 regardless of being x or y — colbyjackson, Feb 06 '19 at 01:54
Is my answer what you were looking for? Or I have misunderstood it? If it doesn't make sense I'll delete it. — Ronak Shah, Feb 06 '19 at 02:24
I would recommend creating a new column, `z = (x + y) / 2`, and then use whichever answer you like best from the [Calculate mean by group R-FAQ](https://stackoverflow.com/q/11562656/903061). — Gregor Thomas, Feb 06 '19 at 03:11

Ronak Shah · Accepted Answer · 2019-02-06T04:13:18.867

We could try using sapply by subsetting the dataframe on each unique cluster and then taking the mean of all the values in the dataframe.

with(data, sapply(sort(unique(cluster)), function(x) 
             mean(unlist(data[cluster == x, -3]))))

#[1] -0.1236613 -0.1849584

Or similarly with split

sapply(split(data[1:2], data$cluster), function(x) mean(unlist(x)))

#         1          2 
#-0.1236613 -0.1849584

We could also do

with(data, tapply((x + y) / 2, cluster, mean))  #suggested by @Gregor

OR

aggregate((x+y)/2~cluster,data, mean)

As mentioned by @Gregor in comments, you could create a new column with (x + y)/2) and it would be easy for calculations.

data

set.seed(1234)
n=50
data = data.frame(x = rnorm(n), y=rnorm(n),cluster = sample(1:2,n,replace=T))

For one more `base` method, `with(data, tapply((x + y) / 2, cluster, mean))` — Gregor Thomas, Feb 06 '19 at 03:14

score 1 · Answer 2 · answered Feb 06 '19 at 02:26

Here's a tidyverse method. Convert to long format and group by cluster.

Solution

data %>% 
  gather(var, value, -cluster) %>% 
  group_by(cluster) %>% 
  summarize(mean = mean(value))

# A tibble: 2 x 2
  cluster     mean
    <int>    <dbl>
1       1 -0.00152
2       2  0.327

Data

data <- read.table(header = T, stringsAsFactors = F, text = "
x          y    cluster
-0.89691455  0.41765075   2
0.18484918  0.98175278   1
1.58784533 -0.39269536   1
-1.13037567 -1.03966898   1
-0.08025176  1.78222896   2
0.13242028 -2.31106908   2
0.70795473  0.87860458   2
-0.23969802  0.03580672   1
1.98447394  1.01282869   2
-0.13878701  0.43226515   2")

How to calculate the mean of y for x=1

2 Answers2

Solution

Data