0

Now I'm fairly new to R, but I know there are a lot of answers to this on various places.

Although I welcome suggestions on how to achieve this, my question is more about why this operation is not simpler (or if it is simpler, I'd love to know how to do it because I've been searching for a while so please point me to the right post or resource).

I have a dataset, say it looks like this:

v1 <- runif(5, 1, 7)
v2 <- runif(5, 1, 7)
v3 <- runif(5, 1, 7)
v4 <- runif(5, 1, 7)
v5 <- runif(5, 1, 7)
df <- as.data.frame(cbind(v1, v2, v3, v4, v5))

Now instead of having 5 variables I have a thousand.

I want to compute the mean for var2:var4 and I want these values to be stored in a new column so that each row has its own mean value. I would call this "averaging across rows" but I realize there may be a different way to describe it.

For each row, I want the average to be computed based on all available values on that row. If a person happens to have not answered a question (eg blank or NA), I still want that person to be included.

I don't want to have to count the columns in order to call them, I know the names of the variables. I don't want to type several lines of code like they do in this post or in this post.

This is such a common operation in social sciences and I have a feeling it should be (or it is) simpler. If it is simpler, I'm not sure why I'm unable to find a simpler solution. In SPSS, for example, I would type something like:

COMPUTE newvar = mean(var2 to var4).
execute.

How do I do this in R?

My first intuition was to try something like this (which does not work):

df$newvar <- rowMeans(df, nat1:nat6)

I’ve been able to achieve my desired result with the following code:

itemstouse <- select(df, var2:var4)
df$newvar <- rowMeans(itemstouse)

Or I could include it in one line like this:

df$newvar <- rowMeans(select(df, var2:var4))

But that still requires three operations. It seems like it should be simpler and I'm confused as to why I'm unable to find a solution as simple as the SPSS script.

I admit, I am a noob when it comes to R, but some things should be fairly intuitive. ggplot is very intuitive, for example. And many things in R are quite easy to learn, but this one is tripping me up for some reason so I'd appreciate your input.

user1981275
  • 13,002
  • 8
  • 72
  • 101

2 Answers2

0

If I have read your problem correctly, it is as follows. You have a matrix of 1000 columns but you are interested in var2 to var4 only. Then for each row, you want to compute the mean and then store it as a new column. If this is right then we are looking for apply function. My code as below. Assuming that your bigger dataset is called MyDF.

Subset_DF <- MyDF[:,2:4]
NewCol <- apply(Subset_DF, MARGIN=1, FUN=mean)
MyDF$NewCol <- NewCol

Please let me know if this is what you wanted.

Amit
  • 2,018
  • 1
  • 8
  • 12
  • thanks for the reply but this is not what I'm looking for. IF I understand your code correctly, a line like ```MyDF[:,2:4]``` requires me to know the column numbers, which I don't want ot have to find. I'm looking for a simple solution (as simple as the SPSS code) that requies two functions at max. – socialresearcher Jul 26 '19 at 20:33
0

There is a way to cascade operations using dyplr which makes this kind of thing relatively easy to do. For example, you can do the same using something like this which should give you the end results that you are looking for.

library(dplyr)

v1 <- runif(5, 1, 7)
v2 <- runif(5, 1, 7)
v3 <- runif(5, 1, 7)
v4 <- runif(5, 1, 7)
v5 <- runif(5, 1, 7)
df <- as.data.frame(cbind(v1, v2, v3, v4, v5))

df %>% mutate(mean_somecols = rowMeans(.[grep("v[2-4]", names(.))]))
Nikhil Gupta
  • 1,436
  • 12
  • 15
  • Thanks but this is still more complicated than the solution in my original post. It also looks like it requires me to know the column numbers if I understand this piece correctly ```"v[2-4]"``` – socialresearcher Jul 26 '19 at 20:38
  • No, it is not referring to the column number (order does not matter in this solution). It is only looking for column names v2, v3 and v4 (which you would anyway need to know). So I don't think it required anything additional. The names(.) gets the column names and the grep command looks for all column names that have v2, v3 or v4 (hence order does not matter). You can make the search as fancy as you want with more or less complicated regular expressions if you want. – Nikhil Gupta Jul 27 '19 at 20:10