0

I'm kind of new to this so I will try to make it as clear as possible.

I have a data frame composed of 5 vectors. The first one contains country names and the other 5 are values that came from a survey. Each row represents the answers given by someone to said survey.

Country     V1  V2  V3  V4  V5
Canada       1   2   4   1   2
Canada       1   3   2   2   3
Switzerland  3   1   1   1   1
Switzerland  1   4   3   4   2

I want to create a code that aggregate the results of my survey questions and then calculate the mean for every country in my data frame so I end up with only one value per country.

vAggregateCan <- (V1 + V2 + V3 + V4 + V5)/5

canValue <- mean(vAggregateCan)

I want to end up with a new data frame that contains only one entry of each country present in my initial data frame associated with the value resulting from the code I show as an example. That should look like this.

Country      Value
Canada       canValue
Switzerland  swissValue

Problem is, I don't know how to tell R to only aggregate the values for specific countries and I would like to find a way to do that operation as simply as possible without having to repeat the same operations over and over.

Like I said I'm kind of new with that and I'm not sure if my question is clear. Thanks for the help.

  • you can do this with the group_by() function in the package dplyr – gfgm Dec 05 '16 at 21:17
  • Shouldn't your average divide by `10` for the cases of `Canada` and `Switzerland` because there are two rows of 5 for each country? – aichao Dec 05 '16 at 21:18
  • Please check this: https://stackoverflow.com/questions/10945703/calculate-row-means-on-subset-of-columns, I think it answer your question! – Guz Dec 05 '16 at 21:19
  • Thank you for the link @guzman, would you like to add a `dplyr` or a better `base` solution to the link? – Pierre L Dec 05 '16 at 21:31
  • It is actually similar, but in your example you seem to have only one occurence of each of your IDs, how does that work occurences of the same ID? – RussLord Dec 05 '16 at 21:31
  • You actually have two different operations to do 1) the mean of each row and 2) the mean of each country – Pierre L Dec 05 '16 at 21:45
  • @Zheyuan Li the link does not have both operations covered. – Pierre L Dec 05 '16 at 22:04

2 Answers2

1

Due to the fact that you have multiple operations to carry out, the link is not an exact duplicate. We will 1) find the mean of each Country, and 2) find the mean of each row. The most straight-forward approach is using base functions.

df2 <- aggregate(. ~ Country, data=df1, mean)
data.frame(Country=df2$Country, Mean=rowMeans(df2[-1]))
#       Country Mean
# 1      Canada  2.1
# 2 Switzerland  2.1

The data.table and dplyr packages work, but they require extra hoops to jump through since they are built to work best column-wise. See this explanation for more on why row-wise operations are discouraged.

But if you need to use those packages:

library(data.table)
setDT(df1)[, .(Mean = rowMeans(.SD)), by=Country][, .(Mean=mean(Mean)), by=Country]
#        Country Mean
# 1:      Canada  2.1
# 2: Switzerland  2.1

library(dplyr)
df1 %>% group_by(Country) %>% 
  summarise_all(mean, na.rm=TRUE) %>%
  rowwise() %>% do(data.frame(Country=.$Country, Mean=rowMeans(.[,-1])))  
#       Country Mean
# 1      Canada  2.1
# 2 Switzerland  2.1
Community
  • 1
  • 1
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • With data.table `melt(setDT(DF))[, mean(value), by=Country]` and I guess similar not-jumping-through-hoops is possible with tidyr..? – Frank Dec 05 '16 at 23:01
-1

The rowMeans function can give a row-wise set of means and then tapply can aggregate within those country values:

tapply( rowMeans(df[-1]), df[['Country']], mean)
     Canada Switzerland 
        2.1         2.1 

If you need to index that object to extract a single value, then:

tapply( rowMeans(df[-1]), df[['Country']], mean)['Canada']
Canada 
   2.1 
IRTFM
  • 258,963
  • 21
  • 364
  • 487