1

I have an unusual data set that I need to work with and I've created a small scale, reproducible example.

library(data.table)
DT <- data.table(Type = c("A", rep("", 4), "B", rep("", 3), "C", rep("", 5)), Cohort = c(NA,1:4, NA, 5:7, NA, 8:12))
dt <- data.table(Type = c(rep("A", 4), rep("B", 3), rep("C", 5)), Cohort = 1:12)

I need DT to look like dt and the actual dataset has 6.8 million rows. I realize it might be a simple issue but I can't seem to figure it out, maybe setkey? Any help is appreciated, thanks.

abbas786
  • 401
  • 3
  • 11

2 Answers2

5

You can replace "" by NA and use na.locf from the zoo package:

library(zoo)
DT[Type=="",Type:=NA][,Type:=na.locf(Type)][!is.na(Cohort)]
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • Also, check out http://stackoverflow.com/questions/24255200/replace-na-with-last-non-na-in-data-table-by-using-only-data-table if you need to get into `Rcpp` for better performance – Marat Talipov Jun 03 '16 at 23:15
0

Here is another option without using na.locf. Grouped by the cumulative sum of logical vector (Type!=""), we select the first 'Type' and the lead value of 'Cohort', assign (:=) it to the names of 'DT' to replace the original column values and use na.omit to replace the NA rows.

na.omit(DT[, names(DT) :=  .(Type[1L], shift(Cohort, type="lead")), cumsum(Type!="")])
#     Type Cohort
# 1:    A      1
# 2:    A      2
# 3:    A      3
# 4:    A      4
# 5:    B      5
# 6:    B      6
# 7:    B      7
# 8:    C      8
# 9:    C      9
#10:    C     10
#11:    C     11
#12:    C     12
akrun
  • 874,273
  • 37
  • 540
  • 662