how to update a data.table by its subset?

Asked Jun 10 '15 at 06:36

Active Sep 21 '17 at 22:59

Viewed 75 times

Suppose we have a huge data.frame data (60000x8000) which has been converted as data.table by setDT(data). This will create a reference of data rather than make a copy of it, which is great. Then I want to obtain a subset, for example, the first 40000 rows, of data.

id <- rep(FALSE, nrow(data))
id[1:40000] <- TRUE
data <- subset(data, id)

However, the code above doesn't perform well, since it will make a copy of data, and I need to manually call gc() to release the memory. In my example, additional 1.6 GB was consumed by subset, which can be completely released by gc(). I have read some documents about the use of data.table and maybe I have missed something important. The examples I have found focus on extracting the subset of a data.table and then assign it to a new variable, instead of to update the original one. For example:

new.data <- subset(data, id)

Thanks a lot.

edited Sep 22 '17 at 17:44

Community

asked Jun 10 '15 at 06:36

Han Zhang

5

`data[1:40000]` is enough to get the first 40000 rows. – nicola Jun 10 '15 at 06:47
@nicola Yes, but `data <- data[1:40000, ]` still makes a copy. The original `data` takes about 2 GB, and overall 3.6 GB are used after `data <- data[1:40000, ]` is run. Is there a more memory efficient way to update `data` as its subset? – Han Zhang Jun 10 '15 at 14:28
1

If I understand you, you want to remove rows by reference. See here: http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table and in particular @vc273 answer that offers a workaround for this problem. – nicola Jun 10 '15 at 14:45
@nicola vc273 gave a great solution on it! Thank you. – Han Zhang Jun 10 '15 at 17:06

how to update a data.table by its subset?

0 Answers0