How to find (not replace) leading NAs, gaps, and final NAs in data.table columns by category

Question

I'm trying to get an idea about the type of missings in my panel dataset. I think there can be three cases:

leading NA's; before data starts for a certain individual
gaps; so missing data for a couple of time periods after which data restarts
NA's at the end; of the data if an individual stops early

I'm not looking for functions that directly change them or fill them in. Instead, I want to decide what to do with them after I have an idea of the problem.

How to get rid of leading NA's (but not how to see how many you have) is solved here. Addressing all NA's is straightforward:

library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA]
# in Petal lenth setosa has lead NA's, versicolor a gap, virginica NA's at the end

Data[is.na(Petal.Length)] # this is a mix of all three types of NA's

But I want to differentiate the three cases. Ideally, I'd like to address them directly in data.table as

"give me a data table with all observations that have leading NAs in Petal.Length"
"give me a data table with observations that are gaps in Petal.Length"
"give me a data table with observations that are NA's during the last time periods per individual"

For lead NA's I can still get it done but it feels super clumsy:

Data[!is.na(Petal.Length), firstobs := ifelse(min(time) == time, 1, 0), by = Species]
Data[, mintime := max(firstobs * time, na.rm = T), by = Species]
Data[time < mintime]

I guess something similar could be done with max and leads for the last NA's but I cant get my head around gaps and those are the most important ones for me. The solutions I found online usually directly fill in, delete or shift these NA's, I just want to have a look.

Desired output would be:

leading NAs:

Data[1:5]

gaps:

Data[60:65]

NA's at the end:

Data[145:150]

But I'd like to get these by checking where the three types of NA's are as my actual dataset is to large to check this manually.

edit: I should add that in my real dataset, I don't know when every individual starts reporting data. So:

Data[is.na(Petal.Length), time, by= Species]

will not help me.

`Data[, rle(is.na(Petal.Length)), by= Species]` could be what you need? — dww, Mar 22 '18 at 12:02
@dww that gives me the number of NAs, very useful, thanks! I didn't know about rle(). But its not what I'm looking for atm, I want to distinguish the three types of NAs — Jakob, Mar 22 '18 at 12:52
Pretty sure it does that - the NA's tha come before any non-NA's are clearly at the start. And so on. You just need to figure out how to interpret it. — dww, Mar 22 '18 at 14:49
you're right, thanks! Its a nice direct way to see the individuals with gaps etc — Jakob, Mar 22 '18 at 14:58

score 1 · Answer 1 · answered Mar 22 '18 at 11:43

It seems as though you could define a function as a helper here:

my.fun <- function(spe){
  k1 <- intersect(which(dat$Species == spe), which(is.na(dat$Petal.Length)))
  k2 <- intersect(which(dat$Species == spe), which(!is.na(dat$Petal.Length)))
  mintime <- min(dat$time[k2])
  mintime.na <- min(dat$time[k1])
  c <- (length(k1) > 0) && (mintime.na <= mintime)
  if(c){
    x <- cbind(dat[k1,], mintime)
    return(x)
    }
}
species.list <- as.character(unique(Data$Species))
sapply(species.list, my.fun)
# returns
$setosa
   Species Petal.Length time mintime
1:  setosa           NA 1951    1956
2:  setosa           NA 1952    1956
3:  setosa           NA 1953    1956
4:  setosa           NA 1954    1956
5:  setosa           NA 1955    1956

$versicolor
NULL

$virginica
NULL

It is still unclear to me though what exactly you mean by "with all the gaps in Petal.Length".

thanks! I mean the cases where I have data, then no data, then again data within one species. This is the case for versicolor in my example — Jakob, Mar 22 '18 at 12:53

Frank · Accepted Answer · 2018-03-22T15:35:49.610

One way:

Data[, g := {
  r = rleid(vna <- is.na(Petal.Length))
  if (first(vna)) r = replace(r, r == 1L, 0L)
  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  replace(r, !vna, NA_integer_)
}, by=Species]

Confirming that it matches the rows expected by the OP...

> # leading
> Data[g == 0L, which = TRUE]
[1] 1 2 3 4 5
> # trailing
> Data[g == 9999L, which = TRUE]
[1] 145 146 147 148 149 150
> # gaps
> Data[!.(c(0L, 9999L, NA_integer_)), on="g", which = TRUE]
[1] 60 61 62 63 64 65

To just take the subset, use these commands without the which = TRUE argument.

Beyond just identifying the rows in each of the three categories, this approach also identifies gaps via distinct g values if there are multiple.

How it works

You can insert some print and cat instructions to follow what each object looks like during the loop:

csprintf <- function(s, ...) cat(sprintf(s, ...))
Data[, g := {
  csprintf("Group: %s = %s %s\n", toString(names(.BY)), toString(.BY), strrep("*", 60))

  r = rleid(vna <- is.na(Petal.Length))
  csprintf("NA positions and initial grouping vector:\n")
  print(data.table(Petal.Length, r, vna))

  if (first(vna)) r = replace(r, r == 1L, 0L)
  csprintf("NA positions and grouping vector after tagging leading NAs:\n")
  print(data.table(Petal.Length, r, vna))

  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  csprintf("NA positions and grouping vector after tagging trailing NAs:\n")
  print(data.table(Petal.Length, r, vna))

  r = replace(r, !vna, NA_integer_)
  csprintf("NA positions and grouping vector after tagging non-NAs:\n")
  print(data.table(Petal.Length, r, vna))

  cat(strrep("\n", 2))

  r
}, by=Species]

Pretty much, it creates the vna vector indicating NA positions and the r vector that groups runs in vna. Then it assigns special codes to some certain runs that can later be used for filtering.

that works great but I don't really understand it... Could you maybe provide some hints about what is happening? So you create vna and directly condition on it? And what is NA_integer_? — Jakob, Mar 22 '18 at 15:03
@PeterPan NA_integer_ is just an integer NA. I have some notes on it in the box here http://franknarf1.github.io/r-tutorial/_book/basics.html#na or you could read `?NA` I've added a brief explanation and some suggestions for tracking what's going on. I guess instead of inserting cat/print statements, you could use `browser()`, but I am not very familiar with that approach. — Frank, Mar 22 '18 at 15:38

score 1 · Answer 3 · answered Mar 23 '23 at 05:04

Here is one option to do it with dplyr. Note: this solution does not work if your data has zero values

Inspired by: How delete leading and trailing rows by condition in R? and How to delete rows for leading and trailing NAs by group in R

library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA] 


data_NAident <- Data %>% 
  # cumsum does not handle with NAs so need to replace them with 0 (this solution thus does not work if data has zero values)
  mutate(valueTEMP = ifelse(is.na(Petal.Length), 0, Petal.Length )) %>% 
  mutate(isLeadingNA = cumsum(valueTEMP) == 0,
         isTrailingNA = rev(cumsum(rev(valueTEMP))) ==0,
         isGapNA = valueTEMP == 0 & !isLeadingNA & !isTrailingNA) %>% 
  select(-valueTEMP)

dataLeadingNAs <- data_NAident %>% 
  filter(isLeadingNA)

dataTrailingNAs <- data_NAident %>% 
  filter(isTrailingNA)

dataGapNAs <- data_NAident %>% 
  filter(isGapNA)

How to find (not replace) leading NAs, gaps, and final NAs in data.table columns by category

3 Answers3

Linked