7

I have a large list of data.frames that need to be bound pairwise by columns and then by rows prior to being fed into a predictive model. As no values will be modified, I would like to have the final data.frame pointing to the original data.frames in my list.

For example:

library(pryr)

#individual dataframes
df1 <- data.frame(a=1:1e6+0, b=1:1e6+1)
df2 <- data.frame(a=1:1e6+2, b=1:1e6+3)
df3 <- data.frame(a=1:1e6+4, b=1:1e6+5)

#each occupy 16MB
object_size(df1)  # 16 MB
object_size(df2)  # 16 MB
object_size(df3)  # 16 MB
object_size(df1, df2, df3)  # 48 MB

#will be in a named list
dfs <- list(df1=df1, df2=df2, df3=df3)

#putting into list doesn't create a copy
object_size(df1, df2, df3, dfs)  #48MB

Final data.frame will have this orientation (every unique pair of data.frames bound by columns, then pairs bound by rows):

df1, df2
df1, df3
df2, df3

I am currently implementing this as such:

#generate unique df combinations
df_names <- names(dfs)
pairs <- combn(df_names, 2, simplify=FALSE)

#bind dfs by columns
combo_dfs <- lapply(pairs, function(x) cbind(dfs[[x[1]]], dfs[[x[2]]]))

#no copies created yet
object_size(dfs, combo_dfs)  # 48MB

#bind dfs by rows
combo_df <- do.call(rbind, combo_dfs)

#now data gets copied
object_size(combo_df)  # 96 MB
object_size(dfs, combo_df)  # 144 MB

How can I avoid copying my data but still achieve the same end result?

alexvpickering
  • 632
  • 1
  • 8
  • 20
  • 6
    Don't think you can. In the first manipulations, you were just "moving" R objects from a list to another (a column of a data.frame is an R object by itself). The last step involved creations of new objects (the columns of `combo_df`) which *incidentally* contained the data of two existing objects. A copy is necessary. A vector in R stores its data *contiguously*; you cannot create a vector in which part of the data points to a region and another part to another region. – nicola Apr 26 '16 at 16:52

1 Answers1

1

Storing the values as you hope to would require R to do some compression on the data frame. I don't believe data frames support compression.

If your motivation for wanting to store the data this way is difficulty fitting it in memory, you could try the ff package. This would allow you to store it in a more compact way on disk. The ffdf class seems to have the properties you need:

By default, creating an ’ffdf’ object will NOT create new ff files, instead existing files are ref- erenced. This differs from data.frame , which always creates copies of the input objects, most notably in data.frame(matrix()) , where an input matrix is converted to single columns. ffdf by contrast, will store an input matrix physically as the same matrix and virtually map it to columns.

In addition the ff package is optimized for fast access.

Note that I haven't used this package myself so I can't guarantee it will solve your problem.

Sean
  • 71
  • 4