0

I have the following data.table, with a column of NA values and non-NA values of type character

library(data.table)
dt = fread(...)

print(dt$column1)

   [1]  NA     NA     NA     "1 1"  "1 1"  "1 1"  NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [11] NA     "1 2"  NA     NA     NA     NA     NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [21] NA     NA     NA     NA     NA     NA     NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [31] NA     NA     NA     NA     NA     "1 3"  NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [41] NA     "1 4"  "1 4"  NA     NA     NA     NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [51] NA     NA     NA     NA     NA     NA     NA     NA     NA     NA                                                                                                                                                                                                                                                             
   [61] NA     NA     "1 5"  NA     NA     NA     NA     NA     NA     NA                                                                                                                                                                                                                                                             
   ...

I would like a new column which denotes the labels of consecutive non-NA values, i.e.

print(dt$groups)

   [1]  0     0      0      1      1      1      0      0      0      0                                                                                                                                                                                                                                                             
   [11] 0     2      0      0      0      0      0      0      0      0                                                                                                                                                                                                                                                             
   [21] 0     0      0      0      0      0      0      0      0      0                                                                                                                                                                                                                                                            
   [31] 0     0      0      0      0      3      0      0      0      0                                                                                                                                                                                                                                                             
   [41] 0     4      4      0      0      0      0      0      0      0                                                                                                                                                                                                                                                           
   [51] 0     0      0      0      0      0      0      0      0      0                                                                                                                                                                                                                                                            
   [61] 0     0      5      0      0      0      0      0      0      0                                                                                                                                                                                                                                                               
   ...

If I try this:

dt[, groups := !is.na(column1)]

This will give me a boolean vector, with consecutive TRUE statements. I am not sure however how to translate this into labels for consecutive pairs of TRUE.

is there a data.table way to do this?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • Probably `dt[, groups := !is.na(column1) * cumsum(!is.na(column1))]` will work, but it is difficult to test code without a [working example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – lmo May 08 '17 at 17:06
  • @lmo, nice try. But this will assign different group id to the first 3 consecutive non-NA values. and you missed a pair of parentheses: `(!is.na(column1)) * cumsum(!is.na(column1))` – mt1022 May 08 '17 at 17:17
  • 1
    @mt1022 Thanks. Like I said, it's hard to test code without a working example. – lmo May 08 '17 at 17:24

1 Answers1

3

Here is a solution with rle:

library(data.table)
set.seed(1)
dt <- data.table(column1 = sample(c(rep(NA, 3), '1'), 30, replace = T))

x <- rle(dt$column1)
y <- cumsum(!is.na(x$values))
y[duplicated(y)] <- 0
x$values <- y
set(dt, NULL, 'group', inverse.rle(x))

# > dt
#     column1 group
# 1:      NA     0
# 2:      NA     0
# 3:      NA     0
# 4:       1     1
# 5:      NA     0
# 6:       1     2
# 7:       1     2
# 8:      NA     0
# 9:      NA     0
# 10:      NA     0
# 11:      NA     0
# 12:      NA     0
# 13:      NA     0
# 14:      NA     0
# 15:       1     3
# 16:      NA     0
# 17:      NA     0
# 18:       1     4
# 19:      NA     0
# 20:       1     5
# 21:       1     5
# 22:      NA     0
# 23:      NA     0
# 24:      NA     0
# 25:      NA     0
# 26:      NA     0
# 27:      NA     0
# 28:      NA     0
# 29:       1     6
# 30:      NA     0
# column1 group

Incorporating lmo's comment, the could be simpilied to:

x <- rle(dt$column1)
x$values <- (!is.na(x$values)) * cumsum(!is.na(x$values))

set(dt, NULL, 'group', inverse.rle(x))
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • Thanks for the help. I'm not following your variable names though: What is `x$values` in the first code snippet? I get the error `Error in x$values : $ operator is invalid for atomic vectors ` – ShanZhengYang May 08 '17 at 17:56
  • @ShanZhengYang, `x <- rle(dt$column1)` will create a list with `rle` class. It has two elements: `lengths` and `values`. – mt1022 May 08 '17 at 18:02