Iterate through columns in dplyr

Question

I am trying to iterate a function progressively through the columns of a database. What are the options available for this? Am I limited to a for loop or is there a dplyr approach or similarly intuitive code structure I can use?

Essentially I have a matrix much much larger than the one constructed below, but of the same general structure. The first column explains which version to select as the refined datacollection, and then all the data that comprises each of these three versions.

library(dplyr)

# Function: creates a matrix of random strings, v = versionNumber
matADv.maker <- function (v){ 
    matADv <- data.frame(matrix(sample(letters[1:26], 10), nrow = 5))
    colnames(matADv) <- paste0("v", v, "_", letters[24:25])
    return(matADv)
}

set.seed(1)
lvl <- data.frame(c(as.integer(runif(5, 5, 8))))
colnames(lvl) <- "Level"
matADv5 <- matADv.maker(5)
matADv6 <- matADv.maker(6)
matADv7 <- matADv.maker(7)
matComp <- bind_cols(lvl, matADv5, matADv6, matADv7)
matComp
Source: local data frame [5 x 8]

  rowname Level  v5_x  v5_y  v6_x  v6_y  v7_x  v7_y
    (chr) (int) (chr) (chr) (chr) (chr) (chr) (chr)
1       1     5     x     e     m     t     k     z
2       2     6     z     d     r     e     a     l
3       3     6     p     n     x     z     j     x
4       4     7     o     g     i     c     u     d
5       5     5     b     s     y     u     h     o

I was wondering if there isn't instead some simple way to shuttle a function along the dataframe.

Essentially, How might I be able to iterate through columns by using dplyr or similar simpler constructions than for loops?

`rowwise()` http://stackoverflow.com/questions/21818181/applying-a-function-to-every-row-of-a-table-using-dplyr — Hack-R, Jun 16 '16 at 01:48
I don't want to summarise all the data on a row, I need to mutate one column completely(selecting the correct data for each row in that column), then move to the next one to the right - mutate the answer correctly into all the rows in that one - and then go to the next one to the right, and so on. — leerssej, Jun 16 '16 at 01:54
Someone tacked a sign at the top directing me to *row* iterating algorithms. What I am beginning to surmise is that the approach is never in a one column at a time format, in rowwise procession before progressing on to the next column; instead the data is always processed on a row at a time:proceeding column to column in that row before moving on. What I am principally interested in is finding a way to extend dplyr's utility from mutating only single columns in a dataframe with direct references to get my code to be able to modify multicolumn portions across a dataframe. — leerssej, Jun 16 '16 at 07:07
Woke up and realized Hadley already had prepared the way for a much better solution; avoid looping entirely and process all the appropriate columns in bulk. Still trying to work out if there aren't possibly going to be cases where I still want columnar iterations, but at least in this instance, bulk processing is better. — leerssej, Jun 16 '16 at 15:29
also realized that I had left the matrix maker function out the code so the example wasn't fully replicable. I've added in the missing parts above. — leerssej, Jun 16 '16 at 16:03

eipi10 · Answer 1 · 2016-06-16T17:12:23.310

You can avoid ifelse and looping by taking advantage of the regularity of the column names. Here's a base R solution:

For each row of matComp, you want to find the column whose column name contains the correct value of Level for a given "suffix" (x or y) and assign that value to a new refdat column. You can do this with the match function.

In the code below, we use apply to go through each row of matComp. In each row, we use match to find the index of the desired column so that we can return the correct value for refdat. sapply iterates over the two suffixes, and cbind adds the two new columns to matcomp.

matComp = cbind(matComp, 
                sapply(paste0("refdat_",c("x","y")), function(var) {
                  suffix = substr(var, nchar(var), nchar(var))
                  apply(matComp, 1, function(vec) {
                    vec[match(paste0("v", vec["Level"], "_", suffix), names(vec))]
                  })
                }), stringsAsFactors=FALSE)

  Level v5_x v5_y v6_x v6_y v7_x v7_y refdat_x refdat_y
1     5    x    e    m    t    k    z        x        e
2     6    z    d    r    e    a    l        r        e
3     6    p    n    x    z    j    x        x        z
4     7    o    g    i    c    u    d        u        d
5     5    b    s    y    u    h    o        b        s

Excellent. Very clever approach! Thank you for teaching me how well `match` can function as a logic switch alternative. This looks like it will come in handy in many situations: looking forward to using it. — leerssej, Jun 16 '16 at 16:09

leerssej · Accepted Answer · 2016-11-07T09:13:58.243

Well, I now realize Hadley had already prepared the way for the best solution of all; I need to process the column transformations in bulk. I collect all the columns with the same issue (filter), resolve the issues, and then join all the frames back together in a Split, Apply, Combine approach.

More efficient too: tackling the problem as 3 tables instead of hundreds of columnar iterations testing each data point and then reacting consecutively.

library(magrittr)

matComp %<>% add_rownames
v5Mat <- matComp %>% filter(Level == 5) %>% select(rowname, starts_with("v5"))
v6Mat <- matComp %>% filter(Level == 6) %>% select(rowname, starts_with("v6"))
v7Mat <- matComp %>% filter(Level == 7) %>% select(rowname, starts_with("v7"))
colnames(v5Mat) %<>%  gsub("v\\d_", "ref", .)
colnames(v6Mat) %<>%  gsub("v\\d_", "ref", .)
colnames(v7Mat) %<>%  gsub("v\\d_", "ref", .)
refinedMat <- 
    Reduce(function(...) merge(..., all=TRUE), list(matComp, v5Mat, v6Mat, v7Mat)) %>% 
    group_by(rowname) %>% 
    summarise_each(funs(na.omit))
refinedMat

results in:

 Source: local data frame [5 x 10]

   rowname  refx  refy Level  v5_x  v5_y  v6_x  v6_y  v7_x  v7_y
     (chr) (chr) (chr) (int) (chr) (chr) (chr) (chr) (chr) (chr)
 1       1     x     e     5     x     e     m     t     k     z
 2       2     r     e     6     z     d     r     e     a     l
 3       3     x     z     6     p     n     x     z     j     x
 4       4     u     d     7     o     g     i     c     u     d
 5       5     b     s     5     b     s     y     u     h     o

Looking at hadley's article on tidy data, I see he identifies the pitfall "Column headers are values, not variable names." which you seem to have here. Your col name suffixes are actual data, not var names. Probably your digits following `v` are also essentially data, not var names. https://www.jstatsoft.org/article/view/v059i10 (I haven't looked at your question in any detail, just thinking about your first paragraph here.) — Frank, Jun 16 '16 at 19:33
@Frank - thanks for finding that in tidyr. :-) It certainly seems true; however this seems to work strongly in our favor. If we pull the names up in the colnames function we get to treat them just like any other vector. @ eipi10 extends this transformation even farther to use the actual column headers themselves as direct signals for the logic switches. — leerssej, Jun 16 '16 at 20:58

Iterate through columns in dplyr

2 Answers2