-1

I am producing a very big datasets (>120 Gb), which are actually a list of named (100x100x3) matrices. A very large lists (millions of records). They are then fed to CNN and classified to one of 4 categories. Processing this amount of data at once is taedious and it often stuck my RAM, so I would like to split my dataset into chunks and process the chunks in parallel.

I found a few packages: bigmemory and disk.frame look most suitable. But do they accept lists? Or maybe there are better solutions for lists?

Henrik
  • 65,555
  • 14
  • 143
  • 159
ramen
  • 691
  • 4
  • 20
  • 2
    Please show that you have made an effort to read the documentation for the packages of interest by citing the specific wording that you don't understand. Recommending packages is regarded as opinion based and so off topic for SO. Suggest you review https://cran.r-project.org/web/views/HighPerformanceComputing.html – G. Grothendieck Jan 18 '22 at 14:59
  • Yes, I have made an effort and yes, I checked cran back and forth but I am a noob who does not understands which parameters/trade-offs actually matter as I am not a professional. I do not understand, for example, how nested lists with many levels would be treated, as there is lack of an example of this kind of input. – ramen Jan 18 '22 at 16:14

2 Answers2

0

I had to adjust my data to data.table format, so I did something like this:

I need it to be named, so I extracted names to the vec:

  nameslist <- names(list1)

I converted my list to data.table ("chunk" are my original data from the list1 used as the dummy; this is nested list of matrices; 3 matrices per name to be specific)

dummy_dframe <- data.frame(name= nameslist, chunk = I(list1))

I tried to convert it into the disk.frame:

dummy_diskframe <- as.disk.frame(dummy_dframe)

Then I encountered a following error:

Error in `[.data.table`(df, , { :
The data frame contains these list-columns: 'chunk'. List-columns are not yet supported by disk.frame. Remove these columns to create a disk.frame

So no way to use this for nested list of matrices.

After that I changed approach and decided to process the dummy data.table with column containing name and column containing matrix - I created this in a two-step fashion, based on this thread (used Jonathan Gellar's example):

data.frame with a column containing a matrix in R

Under this scenario, the disk.frame threw another type of error:

Error in `[.data.table`(df, , { :
Column 2 ['mat'] is length 4 but column 1 is length 2; malformed data.table.

So, nope, unfortunately this is not the solution I could use with my datasets. I share this, so other ppl could spare their time.

ramen
  • 691
  • 4
  • 20
0

{disk.frame} only works with tabular data

xiaodai
  • 14,889
  • 18
  • 76
  • 140