2

I have a data set with the dimension of 20 million records and 50 columns. Now I want to load this data set into R. My machine RAM size is 8 GB and my data set size is 35 GB. I have to run my R code on complete data. So far I tried data.table(fread), bigmemory(read.big.matrix) packages to read that data but not succeeded.Is it possible to load 35 GB data into my machine(8 GB)?

If possible please guide me how to overcome this issue?

Thanks in advance.

kondal
  • 161
  • 1
  • 8
  • 4
    http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r does this help? it is about data frames – rmuc8 Feb 20 '15 at 10:04
  • What format is your data in? – LauriK Feb 20 '15 at 10:08
  • my data is in csv format. – kondal Feb 20 '15 at 10:12
  • 3
    Your **file** may be 35GB but your **data** probably isn't. 20,000,000 records times 50 columns = 1E9 values. Assume they are all numeric, that's 8 bytes per value once they are read in = 8E9 bytes = 8GB. Or are they texts? You didn't say. – Spacedman Feb 20 '15 at 12:19

1 Answers1

3

By buying more RAM. Even if you manage to load all your data (seems like it's textual), there won't be any room left in memory to do whatever you want to do with the data afterwards.

It's actually probably the only correct answer if you HAVE TO load everything into RAM at once. You probably don't have to, but even so, buying more RAM might be easier.

Look at the cloud computing options, like Azure or AWS or Google Compute Engine.

LauriK
  • 1,899
  • 15
  • 20
  • In case of textual data, yes. The reason for this is that R is able to internally store textual data in a more compact way. Try `fread()` function from data.table package. Afterwards you may want to save it as .Rdata format by `save()` command. – LauriK Feb 20 '15 at 10:30
  • Actually my data will be in alpha numeric. I have to Iterate that data for thousands of times to get optimal values. How R acts in performance stand point of view? – kondal Feb 20 '15 at 10:42
  • That completely depends on what you want to do. If you write your own R code, then it's going to be slow on this amount of data. If you use packages that have C++ backend, then it might be doable. Also depends if the iterations want additional memory of course. Only one way to find out is to try. – LauriK Feb 20 '15 at 11:53