What is the best way to clean a data.frame in which row values are ordered properly, but are arbitrarily separated by NA columns?

Question

Occasionally, I need to clean very messy datasets, which are the result of importing a pdf table to a spreadsheet. When the pdf file is converted, all of the columns remain in the correct order (in relation to each other), but blank columns are scattered arbitrarily between them.

Here is a greatly simplified example.

data <- data.frame(
  W = sample(1:10),
  X = c("yes","no"," ","yes","no"," "," ","no","yes"," "),
  Y = c(" "," "," "," "," ","no","no"," "," ","yes"),
  Z = c(" "," ","no"," "," "," "," "," "," "," ")
)
data$X <- gsub(" ", NA, data$X)
data$Y <- gsub(" ", NA, data$Y)
data$Z <- gsub(" ", NA, data$Z)

This results in:

> data
    W    X    Y    Z
1   6  yes <NA> <NA>
2   4   no <NA> <NA>
3   3 <NA> <NA>   no
4   5  yes <NA> <NA>
5   9   no <NA> <NA>
6   1 <NA>   no <NA>
7   7 <NA>   no <NA>
8   8   no <NA> <NA>
9  10  yes <NA> <NA>
10  2 <NA>  yes <NA>

I want to get this:

    W    X
1   6   yes
2   4   no
3   3   no
4   5   yes
5   9   no
6   1   no
7   7   no
8   8   no
9   10  yes
10  2   yes

How can I best accomplish this? I need code that can accommodate many versions of this problem, including successive columns with NA values before the column containing the desired values. If I could just remove each individual cell with NA values, while shifting remaining values left, that would work. Is this possible?

So you want to collapse all the columns with `NA` into one column that has only known values? — Matt, Jun 13 '17 at 17:25
See also [How to implement coalesce efficiently in R](https://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r) and links therein. — Henrik, Jun 13 '17 at 17:35

mtoto · Answer 1 · 2017-06-13T18:13:47.523

2

One option in base R is to get the array indices of the non NA values using which(), then subset the dataset according to the resulting matrix of indices sorted by row number.

indices <- which(!is.na(data[,-1]), arr.ind = TRUE)
data$X <- data[,-1][indices[order(indices[,1]),]]

edited Jun 13 '17 at 18:13

answered Jun 13 '17 at 17:26

mtoto

23,919
4
58
71

score 2 · Answer 2 · answered Jun 13 '17 at 17:59

Using matrix subsetting in base R, we can select the non missing values as follows. The outer cbind constructs the two column data.frame. The second column is constructed by matrix subsetting. A matrix is fed to data which identify the desired elements. Here, the rows are selected with seq_len and the columns are selected using max.col to find the column with TRUE for each row. That is, the column that does not have NA in data[-1]. A 1 is added to adjust for the initial missing column.

cbind(data[1L], response=data[cbind(seq_len(nrow(data)), max.col(!is.na(data[-1L])) + 1L)])
    W response
1  10      yes
2   7       no
3   8       no
4   5      yes
5   1       no
6   2       no
7   6       no
8   4       no
9   3      yes
10  9      yes

score 0 · Answer 3 · answered Jun 13 '17 at 17:27

0

Using coalesce from dplyr,

Reduce(dplyr::coalesce, data[-1])
 [1] "yes" "no"  "no"  "yes" "no"  "no"  "no"  "no"  "yes" "yes"

answered Jun 13 '17 at 17:27

Sotos

51,121
6
32
66

score 0 · Answer 4 · answered Jun 13 '17 at 18:47

Another option is pmax

cbind(data[1], response = do.call(pmax, c(data[-1], na.rm = TRUE)))
#    W response
#1   3      yes
#2   6       no
#3  10       no
#4   2      yes
#5   5       no
#6   7       no
#7   8       no
#8   1       no
#9   4      yes
#10  9      yes

What is the best way to clean a data.frame in which row values are ordered properly, but are arbitrarily separated by NA columns?

4 Answers4