3

I want to identify rows that contains NA, and is between a zero and a one. Consider this data.table:

DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

# DT
# a
# 1:  0
# 2: NA
# 3: NA
# 4:  0
# 5: NA
# 6:  1
# 7:  1
# 8: NA
# 9:  0
# 10: NA
# 11:  1
# 12: NA
# 13: NA
# 14: NA
# 15:  0
# 16:  1
# 17:  1
# 18:  0
# 19: NA
# 20:  0

How to identify row no. 5, 8, 10 and 12:14?

Jeppe Olsen
  • 968
  • 8
  • 19

3 Answers3

2

You can try using approx

DT[,b := approx((1:.N)[!is.na(a)],na.omit(a),1:.N)$y]

and then apply

DT[, which(is.na(a) & b>0 & b<1)]

or

DT[, which(is.na(a) & between(b, 0, 1, FALSE))]

which gives

[1]  5  8 10 12 13 14
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • 1
    Thanks for that. Guess you could make it a bit more simple to omit is.na(DT[,a]) &, so it is only which(DT[,b]>0 & DT[,b]<1), as all that is !is.na(a) is always 1 or 0. – Jeppe Olsen Sep 29 '20 at 08:46
1

The begin of the NA-sequences can calculated so:

library("data.table")
DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

r <- DT[, rle(is.na(a))]
R <- data.table(r$values, r$lengths, start=c(1, 1+head(cumsum(r$lengths), -1)))

i <- R[(V1), start]
j <- R[(V1), start+V2-1]
i[(DT[i-1, a] + DT[j+1, a])==1]
# result: [1]  5  8 10 12
jogo
  • 12,469
  • 11
  • 37
  • 42
1

The zoo package and its na.locf() function could help you as described by Dirk Eddelbuettel in here: Replacing NAs with latest non-NA value.

library(data.table)
library(zoo)

DT <- data.table(a = c(0, NA, NA, 0, NA, 1, 1, NA, 0, NA, 1, NA, NA, NA, 0, 1, 1, 0, NA, 0))

non_nas <- DT[!is.na(a), a]
successor <- c(non_nas[-1], 0)
diff <- abs(non_nas - successor)
DT[!is.na(a), diff:=diff]

This will give you a data table as follows:

     a diff
 1:  0    0
 2: NA   NA
 3: NA   NA
 4:  0    1
 5: NA   NA
 6:  1    0
 7:  1    1
 8: NA   NA
 9:  0    1
10: NA   NA
11:  1    1
12: NA   NA
13: NA   NA
14: NA   NA
15:  0    1
16:  1    0
17:  1    1
18:  0    0
19: NA   NA
20:  0    0

The idea here is that every '1' in the diff column tells you that the value in 'a' is going to change after the NAs below.

Now you want to get rid of the NAs in the 'diff' column. For clarity, we put the result into the new column 'b'. This is where the zoo package comes into play:

DT[, b:=na.locf(diff)]

This results in

     a diff b
 1:  0    0 0
 2: NA   NA 0
 3: NA   NA 0
 4:  0    1 1
 5: NA   NA 1
 6:  1    0 0
 7:  1    1 1
 8: NA   NA 1
 9:  0    1 1
10: NA   NA 1
11:  1    1 1
12: NA   NA 1
13: NA   NA 1
14: NA   NA 1
15:  0    1 1
16:  1    0 0
17:  1    1 1
18:  0    0 0
19: NA   NA 0
20:  0    0 0

Eventually

DT[is.na(a) & b == 1, which = TRUE]

will give you:

[1]  5  8 10 12 13 14
Michael Sebald
  • 196
  • 1
  • 8