Expanding data.table by operating on a column

Question

I want to perform an operation on a subset of rows in a data.table that result in a greater number of rows than what I started out with. Is there an easy way to expand the original data.table to accommodate this? If not, how could I accomplish this?

Here's a sample of my original data.

DT <- data.table(my.id=c(1,2,3), unmodified=c("a","b","c"), vals=c("apple",NA,"cat"))
DT
   my.id unmodified  vals
1:     1          a apple
2:     2          b    NA
3:     3          c   cat

And this is my desired output.

DT
   my.id unmodified  vals
1:     1          a apple
2:     2          b   boy
3:     2          b   bat
4:     2          b   bag
5:     3          c   cat

The new rows can appear at the end as well, I don't care about the order. I tried DT[my.id == 2, vals := c("boy","bat","bag")], but it ignores the last 2 entries with a warning.

TIA!

EDIT: My original dataset has about 10 million rows, although the entry with a missing value occurs just once. I'd prefer not to create copies of the data.table, if possible.

score 1 · Answer 1 · answered Jun 18 '17 at 22:54

1

You can use the summarize pattern of data.table by setting the group variables to be my.id and unmodified here; this broadcasts values within each group if the length doesn't match:

DT[, .(vals = if(my.id == 2) c("boy","bat","bag") else vals), .(my.id, unmodified)]

#   my.id unmodified  vals
#1:     1          a apple
#2:     2          b   boy
#3:     2          b   bat
#4:     2          b   bag
#5:     3          c   cat

answered Jun 18 '17 at 22:54

Psidom

209,562
33
339
356

Thanks! This works perfectly on my sample. However, this creates a copy of the original data.table, right? My original data has about 10 million rows and I need to repeat this operation a few times. Is there a modification by reference solution? I added that as an edit to the question just now. – Naumz Jun 18 '17 at 23:04
2

I don't think this can be done by reference, maybe you could try create a small data.table with your desired ids and then bind it with your original one using `rbindlist`. This should be faster than the above group by method. – Psidom Jun 18 '17 at 23:09
1

Yeah, the set of rows cannot be altered by reference: https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table – Frank Jun 19 '17 at 02:27

score 0 · Answer 2 · answered Jun 19 '17 at 02:53

Another option is to subset the datasets that have 'my.id' as 2 and not 2, then rbind

rbind(DT[my.id == 2][, .(my.id, unmodified, vals = c('boy', 'bat', 
                 'bag'))], DT[my.id != 2])[order(my.id)]
#    my.id unmodified  vals
#1:     1          a apple
#2:     2          b   boy
#3:     2          b   bat
#4:     2          b   bag
#5:     3          c   cat

score 0 · Answer 3 · answered Jun 19 '17 at 03:46

> DT <- data.table(my.id=c(1,2,3), unmodified=c("a","b","c"), vals=c("apple",NA,"cat"))
> DT
   my.id unmodified  vals
1:     1          a apple
2:     2          b    NA
3:     3          c   cat
> DT2 <- data.table(my.id=rep(2,3), unmodified=rep("b",3), vals=c("boy","bat","bag"))
> DT2
   my.id unmodified vals
1:     2          b  boy
2:     2          b  bat
3:     2          b  bag



> rbind(DT,DT2)
   my.id unmodified  vals
1:     1          a apple
2:     2          b    NA
3:     3          c   cat
4:     2          b   boy
5:     2          b   bat
6:     2          b   bag
> rbind(DT,DT2)[order(my.id),]
   my.id unmodified  vals
1:     1          a apple
2:     2          b    NA
3:     2          b   boy
4:     2          b   bat
5:     2          b   bag
6:     3          c   cat
> na.omit(rbind(DT,DT2)[order(my.id),])
   my.id unmodified  vals
1:     1          a apple
2:     2          b   boy
3:     2          b   bat
4:     2          b   bag
5:     3          c   cat

Expanding data.table by operating on a column

3 Answers3