1

I asked the same question here but it was closed as my post has been associated with similar questions although they are not related to my question and don't resolve it.

The dataset:

I have a huge data set saved in a matrix where the number of rows is more that one million with a dozen of columns.

The matrix looks like

data <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, NA, 3, NA, 5, NA, NA, NA, 8, NA, 5, NA, 7, NA, NA, NA), ncol=3)
> data
     [,1] [,2] [,3]
[1,]    1    1    8
[2,]   NA   NA   NA
[3,]    2    3    5
[4,]   NA   NA   NA
[5,]    1    5    7
[6,]   NA   NA   NA
[7,]   NA   NA   NA
[8,]   NA   NA   NA

So if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.

The question:

I would like to delete "efficiently" consecutive missing values if there are 3 or more in each column for all columns in the matrix. So I would like to delete consecutive na in a column not a row.

I already saw solutions, like this one, for my question but they were too slow for my huge data set. Do you have other suggestions which can achieve the objective efficiently? Additionally, the suggested answers (1 & 2) for my closed question are deleting if the missing values are consecutive in rows not columns.

EDIT:

Following to the comment below, the output must be like this:

         [,1] [,2] [,3]
    [1,]    1    1    8
    [2,]   NA   NA   NA
    [3,]    2    3    5
    [4,]   NA   NA   NA
    [5,]    1    5    7
  

EDIT:

> data
         [,1] [,2] [,3] [,4]
    [1,]    1    1    8    NA
    [2,]   NA   NA   NA    NA
    [3,]    2    3    5    NA
    [4,]   NA   NA   NA    NA
    [5,]    1    5    7    NA
    [6,]   NA   NA   NA    NA
    [7,]   NA   NA   NA    NA
    [8,]   NA   NA   NA    NA

The expected output

         [,1] [,2] [,3]
    [1,]    1    1    8
    [2,]   NA   NA   NA
    [3,]    2    3    5
    [4,]   NA   NA   NA
    [5,]    1    5    7
   
Jean
  • 57
  • 7

2 Answers2

2

If it is consecutive, then may be rle can be used

i1 <- rowSums(is.na(data)) > 0
# // or just forgot to update here
i1 <- is.na(data[,1])

data[!inverse.rle(within.list(rle(i1), {
          values[values & lengths < 3] <- FALSE})),]

-output

#      [,1] [,2] [,3]
#[1,]    1    1    8
#[2,]   NA   NA   NA
#[3,]    2    3    5
#[4,]   NA   NA   NA
#[5,]    1    5    7

Update

If we have a particular column with all NAs, then we can remove it first

data1 <- data[,colSums(!is.na(data)) != 0]

and now we apply the previous code on the selected column data

i1 <- is.na(data1[,1])

data1[!inverse.rle(within.list(rle(i1), {
      values[values & lengths < 3] <- FALSE})),]

Or we may use rleid from data.table (which would be more efficient)

library(data.table)
data[as.data.table(data)[, .I[!(.N >=3 & is.na(V1))], 
             rleid(is.na(V1))]$V1,]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for your answer. Do you think that this could be the most efficient method for a data set of millions? – Jean Feb 04 '21 at 08:25
  • @Jean I saw your update expected after I posted the answer. Initially, it was looping over the rows with `apply`. Now, with your condition that a row can be all NA if any of the row is NA, then either select the first and check for NA with `is.na` or use `rowSums` on is.na. apply the `rle` on the logical vector to subset the data based on the `lengths` condition – akrun Feb 04 '21 at 08:29
  • Thanks for your updates. What if there is only one column which contains consecutive missing values and other columns not? – Jean Feb 04 '21 at 09:22
  • @Jean In that case, what is your expected output? i.e. are you looking for an `OR` condition or `AND` here i.e. all the columns should have consecutive NAs, then remove it? – akrun Feb 04 '21 at 09:22
  • Yes you are right, effectivly, all columns must contain NAs in the same rows. On the other hand, I am trying to modify your script to include a more general case where there is an additional column contains consecutive NAs regardless the values in other columns. I will add an update for this general case. – Jean Feb 04 '21 at 09:28
2

if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.

I think this is very important information, we can take advantage of it and work only with any 1 column instead of complete dataset. Try :

vec <- data[, 1]
data[!with(rle(is.na(vec)), rep(values & lengths >= 3, lengths)), ]

#     [,1] [,2] [,3]
#[1,]    1    1    8
#[2,]   NA   NA   NA
#[3,]    2    3    5
#[4,]   NA   NA   NA
#[5,]    1    5    7
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213