1

I frequently work with data frames and have to run some sophisticated data wrangling / manipulations by subgroup that is defined in one of the columns. I am aware of dplyr and group_by and know that many things could be solved using group_by. However, often I have to do some pretty intricate calculations and end up just using the 'for' loop.

I was wondering about the existence of some other general approach or paradigm that is faster/more elegant. Maybe map (that I am not very familiar with)?

Below is an example. Notice - it is fake and meaningless. So let's ignore why I need to do those things or the fact that there could be 2 consequtive NAs in a column, etc. That's not the focus of my question. The point is that often I have to operate "within the constraints of a subgroup" and then - inside that subgroup - I have to do operations columnwise, rowwise and sometimes even cellwise.

I also realize that I could probably put most of that code inside a function, split my data frame into a list based on 'group', apply this function to each element of that list and then do.call(rbind...) at the end. But is this the only way?

Thanks a lot for any hints!

library(dplyr)
library(forcats)
set.seed(123)
x <- tibble(group = c(rep('a', 10), rep('b', 10), rep('c', 10)),
                attrib = c(sample(c("one", "two", "three", "four"), 10, replace = T),
                           sample(c("one", "two", "three"), 10, replace = T),
                           sample(c("one", "three", "four"), 10, replace = T)),
                v1 = sample(c(1:5, NA), 30, replace = T),
                v2 = sample(c(1:5, NA), 30, replace = T),
                v3 = sample(c(1:5, NA), 30, replace = T),
                n1 = abs(rnorm(30)), n2 = abs(rnorm(30)), n3 = abs(rnorm(30)))

v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)

results <- NULL  # Placeholder for final results

for(i in seq(length(unique(x$group)))) { # loop through groups
  mygroup <- unique(x$group)[i]
  mysubtable <- x %>% filter(group == mygroup)

  # IMPUTE NAs in v columns
  # Replace every NA with a mean of values above and below it; and if it's the first or 
  # the last value, with the mean of 2 values below or above it.
  for (v in v_vars){  # loop through v columns
    which_nas <- which(is.na(mysubtable[[v]])) # create index of NAs for column v
    if (length(which_nas) == 0) next else {
      for (na in which_nas) { # loop through indexes of column values that are NAs
        if (na == 1) {
          mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na + 1], 
                                      mysubtable[[v]][na + 2]), na.rm = TRUE)
        } else if (na == nrow(mysubtable)) {
          mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 2],
                                      mysubtable[[v]][na - 1]), na.rm = TRUE)
        } else {
          mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 1], 
                                      mysubtable[[v]][na + 1]), na.rm = TRUE)
        }
      } # end of loop through NA indexes
    } # end of else
  } # end of loop through v vars

  # Aggregate v columns (mean) for each value of column 'attrib'
  result1 <- mysubtable %>% group_by(attrib) %>% 
    summarize_at(v_vars, mean)
  # Aggregate n columns (sum) for each value of column 'attrib'
  result2 <- mysubtable %>% group_by(attrib) %>% 
    summarize_at(n_vars, sum)
  # final result should contain the name of the group
  results[[i]] <- cbind(mygroup, result1, result2[-1])
}
results <- do.call(rbind, results)
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
user3245256
  • 1,842
  • 4
  • 24
  • 51

2 Answers2

2

Maybe this example is too simple, but in this case, the only thing you need to pull out is the imputation.

my_impute <- function(x) {
  which_nas <- which(is.na(x))
  for (na in which_nas) {
    if (na == 1) {
      x[na] <- mean(c(x[na + 1], x[na + 2]), na.rm = TRUE)
    } else if (na == length(x)) {
      x[na] <- mean(c(x[na - 2], x[na - 1]), na.rm = TRUE)
    } else {
      x[na] <- mean(c(x[na - 1], x[na + 1]), na.rm = TRUE)
    }
  }
  x
}

Then you just need to group appropriately and impute and summarize.

x2 <- x %>% group_by(group) %>% mutate_at(v_vars, my_impute) %>%
  group_by(group, attrib) 
full_join(x2 %>% summarize_at(v_vars, mean),
          x2 %>% summarize_at(n_vars, sum))

My usual method for things like this, where similar calculations need to be on a bunch of columns, is to put it in long format. Here it feels a little like the long way round, but perhaps this would be useful to see.

x %>% mutate(row=1:n()) %>% gather("variable", "value", c(v_vars, n_vars)) %>%
  separate(variable, c("var", "x"), sep=1) %>% spread(var, value) %>%
  arrange(group, x, row) %>% group_by(group, x) %>%
  mutate(v=my_impute(v)) %>% group_by(group, attrib, x) %>%
  summarize(v=mean(v), n=sum(n)) %>%
  gather("var", "value", v, n) %>% mutate(X=paste0(var, x)) %>%
  select(-x, -var) %>% spread(X, value)

More generally, split-apply-combine is probably the way to go, as you suggest in your question; here's a way using the tidyverse.

doX <- function(x) {
  x2 <- x %>% mutate_at(v_vars, my_impute) %>% group_by(attrib)
  full_join(x2 %>% summarize_at(v_vars, mean),
            x2 %>% summarize_at(n_vars, sum))
}
x %>% group_by(group) %>% nest() %>%
  mutate(result=map(data, doX)) %>% select(-data) %>% unnest()

The more traditional method is with do.call, split, and rbind; here I don't make the effort to keep the group information.

do.call(rbind, lapply(split(x, x$group), doX))
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
1

The first thing to do is to change your data imputing into a function. I made some simple modifications to have it accept a vector and simplified the call to mean.

fx_na_rm <- function(z) {
  which_nas <- which(is.na(z))

  if (length(which_nas) > 0) {
    for (na in which_nas) { # loop through indexes of column values that are NAs
      if (na == 1) {
        z[na] <- mean(z[na + (1:2)], na.rm = TRUE)
      } else if (na == nrow(mysubtable)) {
        z[na] <- mean(z[na - (1:2)], na.rm = TRUE)
      } else {
        z[na] <- mean(z[c(na - 1, na + 1)], na.rm = TRUE)
      }
    } # end of loop through NA indexes
  }
  return(z)
}

I like data.table so here's a solution that uses it. Now since you use different functions for the n and v variable groups, most purrr or any other solutions will also be a little funny.

library(data.table)
dt <- copy(as.data.table(x))

v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)

dt[, (v_vars) := lapply(.SD, as.numeric), .SDcols = v_vars]
dt[, (v_vars) := lapply(.SD, fx_na_rm), by = group, .SDcols = v_vars]

# see https://stackoverflow.com/questions/50626316/r-data-table-apply-function-a-to-some-columns-and-function-b-to-some-others
scols <- list(v_vars, n_vars)
funs <- rep(c(mean, sum), lengths(scols))

dt[, setNames(Map(function(f, x) f(x), funs, .SD), unlist(scols))
   , by = .(group,attrib)
   , .SDcols = unlist(scols)]

The for loop itself is difficult to vectorize because the results can depend on itself. Here is my attempt which is not an identical output to yours:

# not identical
fx_na_rm2 <- function(z) {
  which_nas <- which(is.na(z))

  if (length(which_nas) > 0) {
    ind <- c(rbind(which_nas - 1 + 2 * (which_nas == 1) + -1 * (which_nas == length(z)),
                   which_nas + 1 + 1 * (which_nas == 1) + -2 * (which_nas == length(z)))) 

    z[which_nas] <- colMeans(matrix(z[ind], nrow = 2), na.rm = T)
  }
  return(z)
}
Cole
  • 11,130
  • 1
  • 9
  • 24