0

Occasionally, I need to clean very messy datasets, which are the result of importing a pdf table to a spreadsheet. When the pdf file is converted, all of the columns remain in the correct order (in relation to each other), but blank columns are scattered arbitrarily between them.

Here is a greatly simplified example.

data <- data.frame(
  W = sample(1:10),
  X = c("yes","no"," ","yes","no"," "," ","no","yes"," "),
  Y = c(" "," "," "," "," ","no","no"," "," ","yes"),
  Z = c(" "," ","no"," "," "," "," "," "," "," ")
)
data$X <- gsub(" ", NA, data$X)
data$Y <- gsub(" ", NA, data$Y)
data$Z <- gsub(" ", NA, data$Z)

This results in:

> data
    W    X    Y    Z
1   6  yes <NA> <NA>
2   4   no <NA> <NA>
3   3 <NA> <NA>   no
4   5  yes <NA> <NA>
5   9   no <NA> <NA>
6   1 <NA>   no <NA>
7   7 <NA>   no <NA>
8   8   no <NA> <NA>
9  10  yes <NA> <NA>
10  2 <NA>  yes <NA>

I want to get this:

    W    X
1   6   yes
2   4   no
3   3   no
4   5   yes
5   9   no
6   1   no
7   7   no
8   8   no
9   10  yes
10  2   yes

How can I best accomplish this? I need code that can accommodate many versions of this problem, including successive columns with NA values before the column containing the desired values. If I could just remove each individual cell with NA values, while shifting remaining values left, that would work. Is this possible?

John J.
  • 1,450
  • 1
  • 13
  • 28
  • So you want to collapse all the columns with `NA` into one column that has only known values? – Matt Jun 13 '17 at 17:25
  • See also [How to implement coalesce efficiently in R](https://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r) and links therein. – Henrik Jun 13 '17 at 17:35

4 Answers4

2

One option in base R is to get the array indices of the non NA values using which(), then subset the dataset according to the resulting matrix of indices sorted by row number.

indices <- which(!is.na(data[,-1]), arr.ind = TRUE)
data$X <- data[,-1][indices[order(indices[,1]),]]
mtoto
  • 23,919
  • 4
  • 58
  • 71
2

Using matrix subsetting in base R, we can select the non missing values as follows. The outer cbind constructs the two column data.frame. The second column is constructed by matrix subsetting. A matrix is fed to data which identify the desired elements. Here, the rows are selected with seq_len and the columns are selected using max.col to find the column with TRUE for each row. That is, the column that does not have NA in data[-1]. A 1 is added to adjust for the initial missing column.

cbind(data[1L], response=data[cbind(seq_len(nrow(data)), max.col(!is.na(data[-1L])) + 1L)])
    W response
1  10      yes
2   7       no
3   8       no
4   5      yes
5   1       no
6   2       no
7   6       no
8   4       no
9   3      yes
10  9      yes
lmo
  • 37,904
  • 9
  • 56
  • 69
0

Using coalesce from dplyr,

Reduce(dplyr::coalesce, data[-1])
 [1] "yes" "no"  "no"  "yes" "no"  "no"  "no"  "no"  "yes" "yes"
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

Another option is pmax

cbind(data[1], response = do.call(pmax, c(data[-1], na.rm = TRUE)))
#    W response
#1   3      yes
#2   6       no
#3  10       no
#4   2      yes
#5   5       no
#6   7       no
#7   8       no
#8   1       no
#9   4      yes
#10  9      yes
akrun
  • 874,273
  • 37
  • 540
  • 662