0

Suppose we have a huge data.frame data (60000x8000) which has been converted as data.table by setDT(data). This will create a reference of data rather than make a copy of it, which is great. Then I want to obtain a subset, for example, the first 40000 rows, of data.

id <- rep(FALSE, nrow(data))
id[1:40000] <- TRUE
data <- subset(data, id)

However, the code above doesn't perform well, since it will make a copy of data, and I need to manually call gc() to release the memory. In my example, additional 1.6 GB was consumed by subset, which can be completely released by gc(). I have read some documents about the use of data.table and maybe I have missed something important. The examples I have found focus on extracting the subset of a data.table and then assign it to a new variable, instead of to update the original one. For example:

new.data <- subset(data, id)

Thanks a lot.

Community
  • 1
  • 1
Han Zhang
  • 103
  • 7
  • 5
    `data[1:40000]` is enough to get the first 40000 rows. – nicola Jun 10 '15 at 06:47
  • @nicola Yes, but `data <- data[1:40000, ]` still makes a copy. The original `data` takes about 2 GB, and overall 3.6 GB are used after `data <- data[1:40000, ]` is run. Is there a more memory efficient way to update `data` as its subset? – Han Zhang Jun 10 '15 at 14:28
  • 1
    If I understand you, you want to remove rows by reference. See here: http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table and in particular @vc273 answer that offers a workaround for this problem. – nicola Jun 10 '15 at 14:45
  • @nicola vc273 gave a great solution on it! Thank you. – Han Zhang Jun 10 '15 at 17:06

0 Answers0