Find rows with NA between 0s and 1s

Question

I want to identify rows that contains NA, and is between a zero and a one. Consider this data.table:

DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

# DT
# a
# 1:  0
# 2: NA
# 3: NA
# 4:  0
# 5: NA
# 6:  1
# 7:  1
# 8: NA
# 9:  0
# 10: NA
# 11:  1
# 12: NA
# 13: NA
# 14: NA
# 15:  0
# 16:  1
# 17:  1
# 18:  0
# 19: NA
# 20:  0

How to identify row no. 5, 8, 10 and 12:14?

ThomasIsCoding · Accepted Answer · 2020-09-29T12:34:46.360

2

You can try using approx

DT[,b := approx((1:.N)[!is.na(a)],na.omit(a),1:.N)$y]

and then apply

DT[, which(is.na(a) & b>0 & b<1)]

or

DT[, which(is.na(a) & between(b, 0, 1, FALSE))]

which gives

[1]  5  8 10 12 13 14

edited Sep 29 '20 at 12:34

answered Sep 29 '20 at 08:05

ThomasIsCoding

96,636
9
24
81

1

Thanks for that. Guess you could make it a bit more simple to omit is.na(DT[,a]) &, so it is only which(DT[,b]>0 & DT[,b]<1), as all that is !is.na(a) is always 1 or 0. – Jeppe Olsen Sep 29 '20 at 08:46

score 1 · Answer 2 · answered Sep 29 '20 at 07:22

The begin of the NA-sequences can calculated so:

library("data.table")
DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

r <- DT[, rle(is.na(a))]
R <- data.table(r$values, r$lengths, start=c(1, 1+head(cumsum(r$lengths), -1)))

i <- R[(V1), start]
j <- R[(V1), start+V2-1]
i[(DT[i-1, a] + DT[j+1, a])==1]
# result: [1]  5  8 10 12

score 1 · Answer 3 · answered Sep 29 '20 at 11:07

The zoo package and its na.locf() function could help you as described by Dirk Eddelbuettel in here: Replacing NAs with latest non-NA value.

library(data.table)
library(zoo)

DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

non_nas <- DT[!is.na(a), a]
successor <- c(non_nas[-1], 0)
diff <- abs(non_nas - successor)
DT[!is.na(a), diff:=diff]

This will give you a data table as follows:

     a diff
 1:  0    0
 2: NA   NA
 3: NA   NA
 4:  0    1
 5: NA   NA
 6:  1    0
 7:  1    1
 8: NA   NA
 9:  0    1
10: NA   NA
11:  1    1
12: NA   NA
13: NA   NA
14: NA   NA
15:  0    1
16:  1    0
17:  1    1
18:  0    0
19: NA   NA
20:  0    0

The idea here is that every '1' in the diff column tells you that the value in 'a' is going to change after the NAs below.

Now you want to get rid of the NAs in the 'diff' column. For clarity, we put the result into the new column 'b'. This is where the zoo package comes into play:

DT[, b:=na.locf(diff)]

This results in

     a diff b
 1:  0    0 0
 2: NA   NA 0
 3: NA   NA 0
 4:  0    1 1
 5: NA   NA 1
 6:  1    0 0
 7:  1    1 1
 8: NA   NA 1
 9:  0    1 1
10: NA   NA 1
11:  1    1 1
12: NA   NA 1
13: NA   NA 1
14: NA   NA 1
15:  0    1 1
16:  1    0 0
17:  1    1 1
18:  0    0 0
19: NA   NA 0
20:  0    0 0

Eventually

DT[is.na(a) & b == 1, which = TRUE]

will give you:

[1]  5  8 10 12 13 14

Find rows with NA between 0s and 1s

3 Answers3