1

I have a dataset that looks like this (mock example):

SW_I    fault X     locomotive A        10 faults     100 days
SW_I    fault X     locomotive B        20 faults     200 days
SW_I    fault X     locomotive C        30 faults     300 days

SW_I    fault Y     locomotive D        90 faults     100 days
SW_I    fault Y     locomotive E        10 faults     100 days

I need the “censored” data to be imputed to look like this:

SW_I    fault X     locomotive A        10 faults     100 days
SW_I    fault X     locomotive B        20 faults     200 days
SW_I    fault X     locomotive C        30 faults     300 days
SW_I    fault X     locomotive D         0 faults     100 days
SW_I    fault X     locomotive E         0 faults     100 days

SW_I    fault Y    locomotive A        0 faults     100 days
SW_I    fault Y    locomotive B        0 faults     200 days
SW_I    fault Y    locomotive C        0 faults     300 days
SW_I    fault Y    locomotive D       90 faults     100 days
SW_I    fault Y    locomotive E       10 faults     100 days

What is the best way to do this with data.tables (the dataset I have is large)? I can a list of unique locomotives for each SW_n, then, subset by fault, and concatenate new lines with all the locomotives not included in the resulting subsetted table, with number of faults = 0, but number of days being the same.

I wonder, however, if there is a cleverer way to do this, with some sort of a merge of two copies of the same table, one with the actual number of faults, and another with zeroes.

P.S. I am not trying to impute missing data. I am trying to definitively show that the censored data are zeroes.

Anarcho-Chossid
  • 2,210
  • 4
  • 27
  • 44
  • So by 'impute' you mean find out which observations aren't in the data at all, and set them at 0 faults and 100 days? – Heroka Oct 28 '15 at 16:02
  • Essentially. I don’t need to “find out” from an outside source: I can assume I have the total population of locomotives in my dataset; so if there is a locomotive that is listed to have a fault A under a specific software version but not fault B, we can assume that number of faults = 0 for fault B, that software version, that locomotive. – Anarcho-Chossid Oct 28 '15 at 16:05
  • 1
    This example might be of use: `DT = data.table(a=c(1,1,2,2), cat=c("A","B","A","D"), x = 1:4); DT[CJ(a = a, cat = cat, unique=TRUE), on=c("a","cat")]`. (Your example is hard to reproduce by copy-pasting, and I'm not clear on how the non-locomotive cols are being filled in.) – Frank Oct 28 '15 at 17:16
  • @Frank Error in `[.data.table`(DT, CJ(a = a, cat = cat, unique = TRUE), on = c("a", : unused argument (on = c("a", "cat")) – Anarcho-Chossid Oct 28 '15 at 17:23
  • Your data.table version is old, I guess. `on=` has been available since version 1.9.6, currently on CRAN. – Frank Oct 28 '15 at 17:23
  • The non-locomotive columns are being filled in by assuming zeroes wherever a locomotive instance was listed under a particular software but not a fault code. For that fault code, the count becomes 0. – Anarcho-Chossid Oct 28 '15 at 17:24
  • 1
    Possible duplicate of [how to insert missing observations on a data frame](http://stackoverflow.com/questions/33003819/how-to-insert-missing-observations-on-a-data-frame) – alexwhitworth Oct 28 '15 at 17:51
  • 1
    possible [duplicate2](http://stackoverflow.com/questions/32979865/) – alexwhitworth Oct 28 '15 at 17:52
  • @Frank: thanks a lot! I wrote a function and a for loop and looped through SW_N column. – Anarcho-Chossid Oct 28 '15 at 21:46

0 Answers0