2

I have been struggling for a while with what feels like should be a very simple operation, and have tried different methods but none of them appear to be fruitful.

I have a dataset that looks like this:

df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"), 
                 year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
                 station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
                 station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
                 station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
                 station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))

Now, what I need, is to create a new variable, let's call it new_station, that takes a value conditional on each name at every given year. For example:

  • For john I need the mean of station1 and station3.
  • For paul I need just station 4.
  • For ringo I need the mean of station1, station2, station3; and
  • For george I need just station4.

I have tried several combinations of filter, select and mutate, along the lines of:

df %>%
  filter(name == "john") %>%
  select(station1, station3) %>%
  mutate(new_station = rowMeans(c(station1, station3)))

But it won't let me assign the value to only the values of a single row. Some other attempts lead each row in the new column, to be the mean of all of the 6 cells (2 stations x 3 years) when I just need the mean for that specific year. Other methods I have tried, are not able to deal with the fact that there are some missing values and I need those omitted.

I need a sort of loop that is scalable, just changing the conditions for each name, since in real life I have something like a dataset of 21 names and 30 stations.

Any thoughts?

Note: In case it illustrates what I'm trying to do, I know how to do this in Stata. In Stata, for the name john, it would look something like:

egen new_station = rowmean(station1 station3) if name == "john"

I just need to do something like that in R.

Thank you!

alf10087
  • 25
  • 4

2 Answers2

3

I would convert the data to a long format, then use case_when. You can convert it back to wide when you're done, if you like.

df$id = 1:nrow(df)

library(tidyr) 
df %>% pivot_longer(
    cols = starts_with("station"), 
    names_to = "station", names_prefix = "station",
    values_to = "value"
  ) %>%
  group_by(name, year) %>%
  mutate(result = case_when(
    name == "john" ~ mean(value[station %in% c(1, 3)], na.rm = TRUE),
    name %in% c("paul", "george") ~ value[station == 4],
    name == "ringo" ~ mean(value[station %in% c(1, 2, 3)], na.rm = TRUE)
  ))
# # A tibble: 48 x 6
# # Groups:   name, year [12]
#    name   year    id station value result
#    <fct> <dbl> <int> <chr>   <dbl>  <dbl>
#  1 john   2018     1 1           1   1   
#  2 john   2018     1 2          NA   1   
#  3 john   2018     1 3          NA   1   
#  4 john   2018     1 4           9   1   
#  5 paul   2018     2 1           2   8   
#  6 paul   2018     2 2           6   8   
#  7 paul   2018     2 3           2   8   
#  8 paul   2018     2 4           8   8   
#  9 ringo  2018     3 1           3   4.67
# 10 ringo  2018     3 2           8   4.67
# # ... with 38 more rows
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you so much, that seems to be doing the job! I'm not very knowledgeable in tidyr, but to revert back to wide I'm trying this, but can't quite get there: `df %>% pivot_wider(id_cols = id, names_prefix = "station", names_sep = "", values_from = "value")` – alf10087 Oct 12 '19 at 03:00
  • In this case, I think it's easier to specify the `names_from`: ` ... `%>% pivot_wider(names_from = "station", names_prefix = "station", values_from = "value")` – Gregor Thomas Oct 12 '19 at 17:16
0

Here's a solution. It relies on creating a lookup table and taking the rowMeans() of a subset of a subset of data. :

library(data.table)

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup <- list(john = c('station1', 'station3'),
               paul = 'station4',
               ringo = c('station1','station2','station3'),
               george = 'station4')

dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
dt

Based on OP comments, it is safer to subset the dt on the names of the lookup table:

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup[[4]] <- NULL
setdiff(dt[, name], names(lookup))

# error
dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
# OK
dt[name %in% names(lookup),
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]

dt

To better understand what is happening, I recommend running the following lines:

dt <- as.data.table(DF)
# what is .SD?
dt[, print(.SD), by = name]
dt[, .SD[,print(.SD) , .SDcols = lookup[[unlist(.BY)]]], by = name]

#what is .BY?
dt[, print(.BY), by = name]
dt[, print(unlist(.BY)), by = name]
dt[, name := as.character(name)]
dt[, print(unlist(.BY)), by = name]

References:

Great explanation of Subset of Data.table: What does .SD stand for in data.table in R

Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thank you! It works on this data set, but when I try on my actual one (which has pretty much the same shape) I get this error: `Error in `[.data.table`(.SD, , .SD, .SDcols = lookup[[as.character(unlist(.BY))]]) : .SDcols should be column numbers or names` Any thoughts on what it could be? – alf10087 Oct 12 '19 at 05:42
  • @alf10087 see edit. Using ```dt[name %in% names(lookup), ...]``` is safer and should prevent the error. It sounds like there are currently names in the ```dt``` that are not in the lookup table. – Cole Oct 12 '19 at 13:57