Select column name based on data frame content R

Question

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:

zz <- data.frame(a = c(1, NA, 3, 5),
                     b = c(NA, 5, 4, NA),
                     c = c(5, 6, NA, 8))

which gives:

   a  b  c
1  1 NA  5
2 NA  5  6
3  3  4 NA
4  5 NA  8

I want to recognize each NA and build a new matrix or df that looks like:

a  c
b  c
a  b
a  c

There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!

Yes. Good question. Meant to put that in the question. Yes, there will be N columns in the final matrix, N = 2 in example, and the number of NAs in each row is the same. — Ernie, Sep 20 '16 at 20:52

score 3 · Accepted Answer · answered Sep 20 '16 at 20:58

3

library(dplyr)
library(tidyr)

zz %>%
  mutate(k = row_number()) %>%
  gather(column, value, a, b, c) %>%
  filter(!is.na(value)) %>%
  group_by(k) %>%
  summarise(temp_var = paste(column, collapse = " ")) %>%
  separate(temp_var, into = c("var1", "var2"))

# A tibble: 4 × 3
      k  var1  var2
* <int> <chr> <chr>
1     1     a     c
2     2     b     c
3     3     a     b
4     4     a     c

answered Sep 20 '16 at 20:58

davechilders

8,693
2
18
18

This certainly works but takes me into the tool boxes of tidyr and dplyr, which I am not yet totally familiar with. Thanks. – Ernie Sep 20 '16 at 21:07

score 3 · Answer 2 · answered Sep 20 '16 at 21:14

3

Here's a possible vectorized base R approach

indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
#    [,1] [,2]
#[1,] "a"  "c" 
#[2,] "b"  "c" 
#[3,] "a"  "b" 
#[4,] "a"  "c"

This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.

answered Sep 20 '16 at 21:14

David Arenburg

91,361
17
137
196

David, very nice. It's a compact solution that avoids plyr and tidyr, both useful but require some study to use proficiently. Thanks. – Ernie Sep 20 '16 at 21:25

score 1 · Answer 3 · edited May 23 '17 at 10:34

1

EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.

cols <- names(zz)
for (column in cols) {
  zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
  cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))

The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.

zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.

I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.

EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

edited May 23 '17 at 10:34

Community

1
1

answered Sep 20 '16 at 21:08

dracodoc

2,603
1
23
33

This is a very bad approach. You running two for loops, one by row while growing objects in the meanwhile. This against all basic programming rules in R – David Arenburg Sep 20 '16 at 21:15
Not sure I follow the answer, new_dt, which for the example is a 3X2 df with column values 1,2,3 and 3,2,6. It's not clear how that gives me the desired answer, which is a 4X2 matrix or df. – Ernie Sep 20 '16 at 21:17
Growing objects is bad, I can create a fixed size list first then rbindlist them together. However that will need `data.table` and OP doesn't want extra packages. I'm not sure if there is a other method to bind data frame rows without growing object. – dracodoc Sep 20 '16 at 21:18
@Ernie, new_dt is 4x2 data.frame for me. Did you run the code? – dracodoc Sep 20 '16 at 21:19
This won't need `data.table`. This is what `do.call(rbind, ...` for – David Arenburg Sep 20 '16 at 21:19
Sorry, I ran your code against a zz that had changed in my example on my machine. Correct example gives the right answer. Thanks. – Ernie Sep 20 '16 at 21:21
@DavidArenburg, yes, I found that too. Code was edited to use `do.call`. – dracodoc Sep 20 '16 at 21:29

Select column name based on data frame content R

3 Answers3