3

I'm trying to get an idea about the type of missings in my panel dataset. I think there can be three cases:

  1. leading NA's; before data starts for a certain individual
  2. gaps; so missing data for a couple of time periods after which data restarts
  3. NA's at the end; of the data if an individual stops early

I'm not looking for functions that directly change them or fill them in. Instead, I want to decide what to do with them after I have an idea of the problem.

How to get rid of leading NA's (but not how to see how many you have) is solved here. Addressing all NA's is straightforward:

library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA]
# in Petal lenth setosa has lead NA's, versicolor a gap, virginica NA's at the end

Data[is.na(Petal.Length)] # this is a mix of all three types of NA's 

But I want to differentiate the three cases. Ideally, I'd like to address them directly in data.table as

  1. "give me a data table with all observations that have leading NAs in Petal.Length"
  2. "give me a data table with observations that are gaps in Petal.Length"
  3. "give me a data table with observations that are NA's during the last time periods per individual"

For lead NA's I can still get it done but it feels super clumsy:

Data[!is.na(Petal.Length), firstobs := ifelse(min(time) == time, 1, 0), by = Species]
Data[, mintime := max(firstobs * time, na.rm = T), by = Species]
Data[time < mintime]

I guess something similar could be done with max and leads for the last NA's but I cant get my head around gaps and those are the most important ones for me. The solutions I found online usually directly fill in, delete or shift these NA's, I just want to have a look.

Desired output would be:

leading NAs:

Data[1:5]

gaps:

Data[60:65]

NA's at the end:

Data[145:150]

But I'd like to get these by checking where the three types of NA's are as my actual dataset is to large to check this manually.

edit: I should add that in my real dataset, I don't know when every individual starts reporting data. So:

Data[is.na(Petal.Length), time, by= Species]

will not help me.

Jakob
  • 1,325
  • 15
  • 31
  • Could you provide a sample of the expected output. – niko Mar 22 '18 at 11:26
  • 1
    `Data[, rle(is.na(Petal.Length)), by= Species]` could be what you need? – dww Mar 22 '18 at 12:02
  • @dww that gives me the number of NAs, very useful, thanks! I didn't know about rle(). But its not what I'm looking for atm, I want to distinguish the three types of NAs – Jakob Mar 22 '18 at 12:52
  • 1
    Pretty sure it does that - the NA's tha come before any non-NA's are clearly at the start. And so on. You just need to figure out how to interpret it. – dww Mar 22 '18 at 14:49
  • you're right, thanks! Its a nice direct way to see the individuals with gaps etc – Jakob Mar 22 '18 at 14:58

3 Answers3

1

It seems as though you could define a function as a helper here:

my.fun <- function(spe){
  k1 <- intersect(which(dat$Species == spe), which(is.na(dat$Petal.Length)))
  k2 <- intersect(which(dat$Species == spe), which(!is.na(dat$Petal.Length)))
  mintime <- min(dat$time[k2])
  mintime.na <- min(dat$time[k1])
  c <- (length(k1) > 0) && (mintime.na <= mintime)
  if(c){
    x <- cbind(dat[k1,], mintime)
    return(x)
    }
}
species.list <- as.character(unique(Data$Species))
sapply(species.list, my.fun)
# returns
$setosa
   Species Petal.Length time mintime
1:  setosa           NA 1951    1956
2:  setosa           NA 1952    1956
3:  setosa           NA 1953    1956
4:  setosa           NA 1954    1956
5:  setosa           NA 1955    1956

$versicolor
NULL

$virginica
NULL

It is still unclear to me though what exactly you mean by "with all the gaps in Petal.Length".

niko
  • 5,253
  • 1
  • 12
  • 32
  • thanks! I mean the cases where I have data, then no data, then again data within one species. This is the case for versicolor in my example – Jakob Mar 22 '18 at 12:53
  • I added the kind of output I'm looking for above – Jakob Mar 22 '18 at 12:55
1

One way:

Data[, g := {
  r = rleid(vna <- is.na(Petal.Length))
  if (first(vna)) r = replace(r, r == 1L, 0L)
  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  replace(r, !vna, NA_integer_)
}, by=Species]

Confirming that it matches the rows expected by the OP...

> # leading
> Data[g == 0L, which = TRUE]
[1] 1 2 3 4 5
> # trailing
> Data[g == 9999L, which = TRUE]
[1] 145 146 147 148 149 150
> # gaps
> Data[!.(c(0L, 9999L, NA_integer_)), on="g", which = TRUE]
[1] 60 61 62 63 64 65

To just take the subset, use these commands without the which = TRUE argument.

Beyond just identifying the rows in each of the three categories, this approach also identifies gaps via distinct g values if there are multiple.


How it works

You can insert some print and cat instructions to follow what each object looks like during the loop:

csprintf <- function(s, ...) cat(sprintf(s, ...))
Data[, g := {
  csprintf("Group: %s = %s %s\n", toString(names(.BY)), toString(.BY), strrep("*", 60))

  r = rleid(vna <- is.na(Petal.Length))
  csprintf("NA positions and initial grouping vector:\n")
  print(data.table(Petal.Length, r, vna))

  if (first(vna)) r = replace(r, r == 1L, 0L)
  csprintf("NA positions and grouping vector after tagging leading NAs:\n")
  print(data.table(Petal.Length, r, vna))

  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  csprintf("NA positions and grouping vector after tagging trailing NAs:\n")
  print(data.table(Petal.Length, r, vna))

  r = replace(r, !vna, NA_integer_)
  csprintf("NA positions and grouping vector after tagging non-NAs:\n")
  print(data.table(Petal.Length, r, vna))

  cat(strrep("\n", 2))

  r
}, by=Species]

Pretty much, it creates the vna vector indicating NA positions and the r vector that groups runs in vna. Then it assigns special codes to some certain runs that can later be used for filtering.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • that works great but I don't really understand it... Could you maybe provide some hints about what is happening? So you create vna and directly condition on it? And what is NA_integer_? – Jakob Mar 22 '18 at 15:03
  • @PeterPan NA_integer_ is just an integer NA. I have some notes on it in the box here http://franknarf1.github.io/r-tutorial/_book/basics.html#na or you could read `?NA` I've added a brief explanation and some suggestions for tracking what's going on. I guess instead of inserting cat/print statements, you could use `browser()`, but I am not very familiar with that approach. – Frank Mar 22 '18 at 15:38
1

Here is one option to do it with dplyr. Note: this solution does not work if your data has zero values

Inspired by: How delete leading and trailing rows by condition in R? and How to delete rows for leading and trailing NAs by group in R

library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA] 


data_NAident <- Data %>% 
  # cumsum does not handle with NAs so need to replace them with 0 (this solution thus does not work if data has zero values)
  mutate(valueTEMP = ifelse(is.na(Petal.Length), 0, Petal.Length )) %>% 
  mutate(isLeadingNA = cumsum(valueTEMP) == 0,
         isTrailingNA = rev(cumsum(rev(valueTEMP))) ==0,
         isGapNA = valueTEMP == 0 & !isLeadingNA & !isTrailingNA) %>% 
  select(-valueTEMP)

dataLeadingNAs <- data_NAident %>% 
  filter(isLeadingNA)

dataTrailingNAs <- data_NAident %>% 
  filter(isTrailingNA)

dataGapNAs <- data_NAident %>% 
  filter(isGapNA)
MattiKummu
  • 21
  • 2