4

I have few huge datatable dt_1, dt_2, ..., dt_N with same cols. I want to bind them together into a single datatable. If I use

dt <- rbind(dt_1, dt_2, ..., dt_N)

or

dt <- rbindlist(list(dt_1, dt_2, ..., dt_N))

then the memory usage is approximately double the amount needed for dt_1,dt_2,...,dt_N. Is there a way to bind them wihout increasing the memory consumption significantly? Note that I do not need dt_1, dt_2, ..., dt_N once they are combined together.

imsc
  • 7,492
  • 7
  • 47
  • 69
  • I'm probably off, but have you considered removing the `dt_1, dt_2` etc from your envonment once you have combined `dt`? – Heroka Jan 13 '16 at 11:25
  • Yes I did remove them afterwards. But during binding the memory is still doubled. – imsc Jan 13 '16 at 12:23
  • See my answer for a probably a bit slower, but possibly more efficient memory usage with removing-while-binding. – Heroka Jan 13 '16 at 12:23
  • @imsc are you asking about `rbind` by reference? sounds cool, not sure if doable – jangorecki Jan 13 '16 at 12:27
  • @jangorecki. That's what I am after... would avoid copying as well as memory consumption. – imsc Jan 13 '16 at 12:29

4 Answers4

4

Other approach, using a temporary file to 'bind':

nobs=10000
d1 <- d2 <- d3 <-  data.table(a=rnorm(nobs),b=rnorm(nobs))
ll<-c('d1','d2','d3')
tmp<-tempfile()

# Write all, writing header only for the first one
for(i in seq_along(ll)) {
  write.table(get(ll[i]),tmp,append=(i!=1),row.names=FALSE,col.names=(i==1))
}

# 'Cleanup' the original objects from memory (should be done by the gc if needed when loading the file
rm(list=ll)

# Read the file in the new object
dt<-fread(tmp)

# Remove the file
unlink(tmp)

Obviously slower than the rbind method, but if you have memory contention, this won't be slower than requiring the system to swap out memory pages.

Of course if your orignal objects are loaded from file at first, prefer concatenating the files before loading in R with another tool most aimed at working with files (cat, awk, etc.)

Tensibai
  • 15,557
  • 1
  • 37
  • 57
3

You can remove your datatables after you've bound them, the double memory-usage is caused by the new dataframe consisting of copies.

Illustration:

#create some data
nobs=10000
d1 <- d2 <- d3 <-  data.table(a=rnorm(nobs),b=rnorm(nobs))
dt <- rbindlist(list(d1,d2,d3))

Then we can look at memory-usage per object source

sort( sapply(ls(),function(x){object.size(get(x))}))
  nobs     d1     d2     d3     dt 
    48 161232 161232 161232 481232 

If the memory-usage is so large the separate datatables and combined datatable cannot coexist, we can (shocking, but IMHO this case warrants it as there are a small number of datatables and it's easily readable and understandable) a for-loop and get to create our combined datatable and delete the individual ones at the same time:

mydts <- c("d1","d2","d3") #vector of datatable names

dt<- data.table() #empty datatable to bind objects to

for(d in mydts){
  dt <- rbind(dt, get(d))
  rm(list=d)
  gc() #garbage collection
}
Community
  • 1
  • 1
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Maybe worth clarifying: `d1 <- d2 <- d3` only occupies the space of the first one in memory. `<- DT` makes a new pointer to a DT, while `<- copy(DT)` will make a new copy (doubling space consumption). Try `address(d1)` and `address(d2)`, which should have the same value. The `tables()` command is also handy for examining memory. Also, not sure why you would `get` here... `L <- list(d1,d2,d3)` can be iterated over and is also just pointers, I think: `sapply(L, address)` – Frank Jan 13 '16 at 15:32
2

I guess <<- and get can help you with this.

UPDATE: <<- is not necessary.

df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- df1
df3 <- df1

dt.lst <- c("df2", "df3")

for (i in dt.lst) {
  df1 <- rbind(df1, get(i))
  rm(list=i)
}

df1
Ven Yao
  • 3,680
  • 2
  • 27
  • 42
0

Thanks for the other great answers, in case your data frames are contained in a large list of data frames. You can use a NULL assignment (explained in this answer) or within (explained in this answer) to remove data frames from the list at each iteration.

# Large list if data frames
l_df <- list(head(iris), iris[c(92:95),], tail(iris))
df_stack <- data.table::data.table()
# As long as the list is not empty,
# Bind the first list item and remove it
while(!identical(l_df, list())){
    df_stack <- rbind(df_stack, l_df[[1]])
    l_df[1] <- NULL
}

This will take less memory than binding the data frame in this way:

l_df <- list(head(iris), iris[c(92:95),], tail(iris))
dfdt = data.table::rbindlist(l_df)

And should give a similar data frame

identical(df_stack, dfdt)
# [1] TRUE
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110