I have been struggling for a while with what feels like should be a very simple operation, and have tried different methods but none of them appear to be fruitful.
I have a dataset that looks like this:
df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"),
year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))
Now, what I need, is to create a new variable, let's call it new_station, that takes a value conditional on each name at every given year. For example:
- For john I need the mean of station1 and station3.
- For paul I need just station 4.
- For ringo I need the mean of station1, station2, station3; and
- For george I need just station4.
I have tried several combinations of filter, select and mutate, along the lines of:
df %>%
filter(name == "john") %>%
select(station1, station3) %>%
mutate(new_station = rowMeans(c(station1, station3)))
But it won't let me assign the value to only the values of a single row. Some other attempts lead each row in the new column, to be the mean of all of the 6 cells (2 stations x 3 years) when I just need the mean for that specific year. Other methods I have tried, are not able to deal with the fact that there are some missing values and I need those omitted.
I need a sort of loop that is scalable, just changing the conditions for each name, since in real life I have something like a dataset of 21 names and 30 stations.
Any thoughts?
Note: In case it illustrates what I'm trying to do, I know how to do this in Stata. In Stata, for the name john, it would look something like:
egen new_station = rowmean(station1 station3) if name == "john"
I just need to do something like that in R.
Thank you!