3

I have a data.table with ~18^6 rows and I need to take the unique values of CLASS, by ID, and set their respective columns to 1, as seen below in a baby example

DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"),
                             a=c(0,0,0,0,0),
                             b=c(0,0,0,0,0),
                             c=c(0,0,0,0,0))

### Start with this
ID CLASS a b c
1      a 0 0 0
1      a 0 0 0
1      b 0 0 0
2      c 0 0 0
2      b 0 0 0

### Want this
ID CLASS a b c
1      a 1 1 0
1      a 1 1 0
1      b 1 1 0
2      c 0 1 1
2      b 0 1 1

My first instinct was to try the code below but found that it will set all columns to 1 because unique(DT$CLASS) inherently includes all unique values for all IDs and is not passed through the "grouping" argument parameter so to say.

### Tried this
DT[,unique(DT$CLASS):=1,by=ID]

### Got this
ID CLASS a b c
1      a 1 1 1
1      a 1 1 1
1      b 1 1 1
2      c 1 1 1
2      b 1 1 1

I've been struggling with utilizing data.table to its full potential and speed and would like to create the desired output only using commands within data.table arguments.

Can someone help me write the proper code, using only data.table commands/arguments,so that my jth index includes only the unique values, by ID, and sets the appropriate columns to 1?

Follow-up Question:

Say that each row also has an associated date to it, RXDATE, and I'd want to create respective column names of all class values that holds the minimum RXDATE by CLASS by ID. Could I also turn to dcast for this?

### Start with this
ID CLASS a b c RXDATE
1      a 1 1 0 1-1-99
1      a 1 1 0 1-2-99
1      b 1 1 0 1-3-99
2      c 0 1 1 5-4-00
2      b 0 1 1 6-5-01

### Want this
ID CLASS a b c RXDATE   a_DT   b_DT   c_DT
1      a 1 1 0 1-1-99 1-1-99 1-3-99     NA
1      a 1 1 0 1-2-99 1-1-99 1-3-99     NA
1      b 1 1 0 1-3-99 1-1-99 1-3-99     NA
2      c 0 1 1 5-4-00     NA 6-5-01 5-4-00
2      b 0 1 1 6-5-01     NA 6-5-01 5-4-00
theneil
  • 488
  • 1
  • 4
  • 14
  • 1
    Unorthodox but you can try: `invisible(DT[, set(DT, .I, unique(CLASS), value=1L), by=ID]); DT` To find a better solution, can you share `DT[, .(nID=uniqueN(ID), nCLASS=uniqueN(CLASS)]` of your actual dataset? – chinsoon12 Aug 29 '19 at 00:44

3 Answers3

2

Using dcast and merge you could also do:

DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"),
                             a=c(0,0,0,0,0),
                             b=c(0,0,0,0,0),
                             c=c(0,0,0,0,0))

# dcast to convert to wide
DT_dcast <- dcast(DT[, .(ID, CLASS)], ID ~ CLASS, fun.aggregate = function(x) length(unique(x)), value.var = "CLASS")
DT_dcast
   ID a b c
1:  1 1 1 0
2:  2 0 1 1

# Then merge with the original data.table
DT_m <- merge(DT[, .(ID, CLASS)], DT_dcast, by = "ID")
DT_m
   ID CLASS a b c
1:  1     a 1 1 0
2:  1     a 1 1 0
3:  1     b 1 1 0
4:  2     c 0 1 1
5:  2     b 0 1 1

EDIT You could still use the same approach with dcast and merge.

I noticed from your 'start with this' data has different RX date for row 2 and from 'want this' data you only kept '1-1-99' for this.

DT2 <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"),
                             a=c(0,0,0,0,0),
                             b=c(0,0,0,0,0),
                             c=c(0,0,0,0,0), 
                             RXDate = c("1-1-99", "1-2-99", "1-3-99", "5-4-00", "6-5-01"))

# 2nd row from the data provided has different RXDate under same ID and Class.
# Use x[1] to pick first
DT_dcast <- dcast(DT2[, .(ID, CLASS, RXDate)], ID ~ CLASS, 
                  fun.aggregate = function(x) x[1], 
                  value.var = c("CLASS", "RXDate"))
DT_dcast
   ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1:  1         a         b      <NA>   1-1-99   1-3-99     <NA>
2:  2      <NA>         b         c     <NA>   6-5-01   5-4-00

# Convert 1 or 0 under CLASS
class_cols <- names(DT_dcast)[grepl("CLASS", names(DT_dcast))]
for (col in class_cols) set(DT_dcast, j = col, value = ifelse(is.na(DT_dcast[[col]]), 0, 1))

DT_dcast
ID CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1:  1         1         1         0   1-1-99   1-3-99     <NA>
2:  2         0         1         1     <NA>   6-5-01   5-4-00

# Then merge with the original data.table
DT_m <- merge(DT2[, .(ID, CLASS, RXDate)], DT_dcast, by = "ID")
DT_m

   ID CLASS RXDate CLASS.1_a CLASS.1_b CLASS.1_c RXDate_a RXDate_b RXDate_c
1:  1     a 1-1-99         1         1         0   1-1-99   1-3-99     <NA>
2:  1     a 1-2-99         1         1         0   1-1-99   1-3-99     <NA>
3:  1     b 1-3-99         1         1         0   1-1-99   1-3-99     <NA>
4:  2     c 5-4-00         0         1         1     <NA>   6-5-01   5-4-00
5:  2     b 6-5-01         0         1         1     <NA>   6-5-01   5-4-00

If you want to rename the columns then you can do it using setnames

MKa
  • 2,248
  • 16
  • 22
  • Hello @MKa, this was really useful and it's the first time I run into the function dcast(). From your experience with this function, can dcast() also be used to create user specified new column names (i.e. for my values of class could I create respective column names but with "_DT" appended to the end like so: "a_DT,b_DT,c_DT"). I have a follow up question that'll add to my initial question and I'm hoping dcast can also be used to solve it. Thank you – theneil Aug 29 '19 at 16:26
  • You can use the same approach for this, I have just updated the answer – MKa Aug 30 '19 at 01:02
0

Here's one option.

unique_wide <- dcast(DT[, unique(CLASS), by = ID], ID ~ V1, value.var = "V1")

classes <- setdiff(names(unique_wide), "ID")
unique_wide[, (classes) := lapply(.SD, function(col) { ifelse(is.na(col), 0L, 1L) }),
            .SDcols = classes]

DT[, (classes) := unique_wide[.SD, classes, on = "ID", with = FALSE]]
DT[]
   ID CLASS a b c
1:  1     a 1 1 0
2:  1     a 1 1 0
3:  1     b 1 1 0
4:  2     c 0 1 1
5:  2     b 0 1 1

We first get the possible unique values with DT[, unique(CLASS), by = ID]. Note that you can directly reference columns in j without needing $. We can reshape that to wide format to get something like this:

   ID    a b    c
1:  1    a b <NA>
2:  2 <NA> b    c

The next 2 lines simply transform values into integers, setting NA to 0 and 1 otherwise.

> unique_wide
   ID a b c
1:  1 1 1 0
2:  2 0 1 1

Afterwards this idiom is applied, which is like a nested join. In this case the join is done based on ID, so all the rows where it matches between DT and unique_wide have the values of non-ID columns updated with what is present in unique_wide. Also note that with = FALSE is useful to select columns based on a variable holding a character vector.

BTW, note that you don't even need the initial 0 non-ID columns, if you declare your table as

DT <- data.table::data.table(ID=c("1","1","1","2","2"),
                             CLASS=c("a","a","b","c","b"))

the code above will still work.

Alexis
  • 4,950
  • 1
  • 18
  • 37
0

Another possible approach:

idx <- DT3[, CJ(I=.I, J=match(unique(CLASS), names(DT))), by=ID]
setDF(DT3)
DT3[as.matrix(idx[, .(I, J)])] <- 1L
setDT(DT3)[]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35