5

I have a data.frame with a large number of columns whose names follow a pattern. Such as:

df <- data.frame(
  x_1 = c(1, NA, 3), 
  x_2 = c(1, 2, 4), 
  y_1 = c(NA, 2, 1), 
  y_2 = c(5, 6, 7)
)

I would like to apply mutate_at to perform the same operation on each pair of columns. As in:

df %>%
  mutate(
    x = ifelse(is.na(x_1), x_2, x_1), 
    y = ifelse(is.na(y_1), y_2, y_1)
  )

Is there a way I can do that with mutate_at/mutate_each?

This:

df %>%
  mutate_each(vars(x_1, y_1), funs(ifelse(is.na(.), vars(x_2, y_2), .)))

and various variations I've tried all fail.

The question is similar to Using functions of multiple columns in a dplyr mutate_at call, but different in that the second argument to the function call is not a single column, but a different column for each column in vars.

Thanks in advance.

Bob
  • 1,274
  • 1
  • 13
  • 26
  • I'm working with a similar thing right now. It's the same issue as in my previous question: https://stackoverflow.com/questions/47005763/looping-across-multiple-variables-and-parameters-using-map-and-mutate, but in this case the dataset is so big that RStudio crashes. – LightonGlass Mar 08 '18 at 21:33
  • a `data.table` `set` loop is probably one of the faster ways to do this. `dplyr::coalesce` might be a bit better for readability as well – zacdav Jul 11 '18 at 13:00

3 Answers3

3

I don't know if you can get it that way, but here's a different perspective on the problem. If you find yourself with really wide data (e.g., tons of columns with similar names) and you want to do something with them, it might help to tidy the data (long in stata terms) with tidyr::gather (see docs here http://tidyr.tidyverse.org/).

> df %>% gather()
   key value
1  x_1     1
2  x_1    NA
3  x_1     3
4  x_2     1
5  x_2     2
6  x_2     4
7  y_1    NA
8  y_1     2
9  y_1     1
10 y_2     5
11 y_2     6
12 y_2     7

After converting the data to this format, it's easier to combine and rearrange values using group_by instead of trying to mutate_at things. E.g., you can ge the first values with df %>% gather() %>% mutate(var = substr(key,1,1)) and manipulate the xs and ys differently using group_by(var).

E_net4
  • 27,810
  • 13
  • 101
  • 139
twedl
  • 1,588
  • 1
  • 17
  • 28
  • I think that would have the opposite effect. After `gather`'ing, I'd then have the problem of grouping by both the prefix of the key, and the unique identifier of the original row. – Bob Jan 26 '18 at 23:53
  • To elaborate on the problem, my `data.frame` has data on some 117 different items, which are lab test results on samples where samples were tested in batches by different labs. There are therefore 7 columns for each item -- the value measured for the sample, the scale of the value sampled, and various data about the batch including batch variance and calibration information. So it takes some pretty substantial processing to get the values to be normalized and consistent. – Bob Jan 26 '18 at 23:57
  • Is it 7 columns total or 117 columns? The more columns you have (and more complex your column operations are), the harder it'll be to get the `mutate` family to do what you like. You might want to make it more tidy (see, e.g., https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) and use `group_by`, which is designed to handle that type of problem. Or you can switch to base R operations (which might handle the complex column operations more easily). – twedl Jan 27 '18 at 00:37
  • Its 819 columns. 7 columns each for 117 measured variables. The process of converting the raw measurements to useable measurements, using the other 6 columns per measured variable, are the same for each of the 117. This is why I'm looking for an approach based on the `mutate_` family of functions. So far what I've been able to come up with is to create 7 matrices of 117 columns each, but this is a rather unfortunate approach that really complicates the code. – Bob Jan 28 '18 at 08:49
1

Old question, but I agree with Jesse that you need to tidy your data a bit. gather would be the way to go, but it lacks somehow the possibility of stats::reshape where you can specify groups of columns to gather. So here's a solution with reshape:

df %>% 
   reshape(varying   = list(c("x_1", "y_1"), c("x_2", "y_2")), 
           times     = c("x", "y"),
           direction = "long") %>% 
   mutate(x = ifelse(is.na(x_1), x_2, x_1)) %>% 
   reshape(idvar     = "id", 
           timevar   = "time",
           direction = "wide") %>% 
   rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
#   id x_1 x_2 x y_1 y_2 y
# 1  1   1   1 1  NA   5 5
# 2  2  NA   2 2   2   6 2
# 3  3   3   4 3   1   7 1

In order to do that with any number of column pairs, you could do something like:

df2 <- setNames(cbind(df, df), c(t(outer(letters[23:26], 1:2, paste, sep = "_"))))
v <- split(names(df2), purrr::map_chr(names(df2), ~ gsub(".*_(.*)", "\\1", .)))
n <- unique(purrr::map_chr(names(df2), ~ gsub("_[0-9]+", "", .) ))
df2 %>% 
    reshape(varying   = v, 
            times     = n,
            direction = "long") %>% 
     mutate(x = ifelse(is.na(!!sym(v[[1]][1])), !!sym(v[[2]][1]), !!sym(v[[1]][1]))) %>% 
     reshape(idvar     = "id", 
             timevar   = "time",
             direction = "wide") %>% 
     rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
#   id w_1 w_2 w x_1 x_2 x y_1 y_2 y z_1 z_2 z
# 1  1   1   1 1  NA   5 5   1   1 1  NA   5 5
# 2  2  NA   2 2   2   6 2  NA   2 2   2   6 2
# 3  3   3   4 3   1   7 1   3   4 3   1   7 1

This assumes that columns which should be compared are next to each other and that all columns for with possible NA values are in columns suffixed by _1 and the replacement value columns are sufficed by _2.

thothal
  • 16,690
  • 3
  • 36
  • 71
  • I think this is right, and thanks. I agree that the data needs to be tidied -- this operation is early in a sequence of steps whose purpose is to tidy the data. – Bob Jul 11 '18 at 18:02
-3

When I asked this question, the answer was "you can't!" That's no longer the answer, since tidyr now supports pivot_wider and pivot_longer.

E_net4
  • 27,810
  • 13
  • 101
  • 139
Bob
  • 1,274
  • 1
  • 13
  • 26