Why I need to assign a data.table to a new object to filter rows?

Question

I'm just starting to learn data.table in r and

library(data.table)
data(iris)
iris[Species == 'setosa']

The code above doesn't filter the rows where the species are setosa in the dataset, it just print the rows where the condition is satisfied.

iris <- iris[Species == 'setosa']

The above code works, but I'm wondering what kind of situation I need to assign a new object for the operation to be effective and not just print the results. Also, is there any risk in assigning on the same object?

This question has been addressed in detail here: [How to delete a row by reference in data.table?](https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table) To summarize, rows cannot be deleted in place, and the method you are using _(assigning to the same object)_ is the correct and idiomatic way to accomplish this task. _(Unless of course, [this feature request](https://github.com/Rdatatable/data.table/issues/635) is ever implemented)_ — Matt Summersgill, Dec 26 '19 at 19:58

score 5 · Accepted Answer · answered Dec 27 '19 at 03:36

Fundamentally, columns are much easier to modify by-reference in R since columns are list elements, and list elements are not stored contiguously in memory.

Removing a column by reference just means unallocating its allotted memory and removing the associated pointers

By contrast, removing some rows is a lot harder and can't really be done by-reference -- some copying is inevitable. Consider this simplified representation of a table with two columns, A and B:

    1  2  3  4  5
A: [ ][ ][ ][ ][ ]
B: [ ][ ][ ][ ][ ]

A is stored in contiguous memory as an array with size 5*sizeof(A). E.g. if A is an integer, it's given 4 bytes per cell. numeric is 8 bytes per cell.

Deleting B is easy from a memory point of view: just tell R/your system you don't need that memory anymore:

    1  2  3  4  5
A: [ ][ ][ ][ ][ ]
B: [x][x][x][x][x]

A's memory allocation is unaffected.

By contrast, consider removing some rows from the table (i.e., both A and B):

    1  2  3  4  5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]

If we simply release the memory for these 4 cells, our table will be broken -- its constituent memory has been split with the 2*sizeof(A)-size gaps between its 1st and 4th rows.

The best we can do is to try and minimize copying by shifting rows 4 & 5, and leaving row 1 alone:

    1  2  3<-4<-5
A: [ ][x][x][ ][ ]
B: [ ][x][x][ ][ ]

    1  4  5
A: [ ][ ][ ]
B: [ ][ ][ ]

In the linked answer, Matt alludes to a very specific case in which the by-reference approach can work -- when the rows to add/drop come at the end. Hopefully the illustration makes it clear why this is easier to do.

This technical difficulty is the reason why the linked feature request is so hard to fill. Copying many columns' data as illustrated is easier said than done & requires a lot of finesse to get it working & communicated back to R from C properly.

Why I need to assign a data.table to a new object to filter rows?

1 Answers1