0

I am looping over csv files and putting the data into a "main" dataframe

I am on windows and using 32 bit R.

for(i in 1:length(files))
{
  print(files[i])
  f <- read.csv(files[i],header=TRUE, stringsAsFactors=FALSE)
  if(i ==1)
  {
    main= f
  }else
  {
    main = rbind(main, f)
  }
  print(dim(main))
  print(memory.size(max = FALSE))
}

I am getting this error:

Error: cannot allocate vector of size 64.0 Mb

The last print out of the dim of main and the size is

[1] 4335123      49
[1] 2139.9

so there are 4.3 million rows in main and I think the size means 2139 mb are being used in R.

Any idea how I can get around this error? Main needs to hold about 7 million rows.

Thank you.

user3022875
  • 8,598
  • 26
  • 103
  • 167

2 Answers2

0

That would be big load of data for an R session (and it might not be possible in 32 bit OS). R needs contiguous memory space for any new object. Shut down R. exit all your other programs and minimize the number of programs that will be loaded when you reboot Windows. Then only load R and retry with a fresh session.

If that fails you will need to think about limiting the number of lines when the files are loaded. Look at `?read.csv" for the parameter that sets the upper limit on number of lines.

7 million rows with 49 columns is going to create an object that is at least 5*7000000*49 bytes wide, and that's only if each column was composed of single character values. If they were numeric columns, then the space requirement would be doubled. The usual configuration of 32-bit Windows allows only 2.5 GB which would theoretically hold that minimal sized data, but even then you would probably not be able to do anything useful with it.

Probably the cheapest step would be to rent some cloud space with an instance of R and memory adequate for the task, say 8 to 16 GB.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • You could also include `gc()` ("garbage collection") at the end of each iteration of your loop, which will cause R to free up any memory that it had allocated but no longer needs. – eipi10 May 12 '15 at 17:03
  • 1
    It probably would not hurt. Garbage collection is supposed to occur automatically. – IRTFM May 12 '15 at 17:16
  • It does, but when you need to be sure it happens right away, it's better to call it explicitly. For example, the help file says "...it can be useful to call gc after a large object has been removed, as this may prompt R to return memory to the operating system." – eipi10 May 12 '15 at 17:27
0

You've adopted a 'copy and append' approach with main = rbind(main, f), which makes n * (n - 1) / 2 copies of the data and uses memory very inefficiently. Instead, try 'pre-allocation and fill'

result = vector("list", length(files))
for (i in seq_along(files)) {
    ## ...
    result[[i]] = f
}

followed by a final rbind():

result = do.call("rbind", result)

This will make only two copies of the data, though you might still be pressed for memory.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112