EDIT : This question is not a duplicate as only reading data is not a problem
I want to do analysis on a csv file in R that is around 10 GB. I am working on a GCE virtual machine that has 60 GB memory.
I would like to know which library of R is suitable for reading and performing operations like filter, groupBy, colMeans etc. with large files
Which of the following should be the best choice (given that memory is not a constraint) -
- Stick with
read.csv
and packages likedplyr
or the apply family. - Use packages like
ff
orbigmemory
for parallel processing. - Use RSpark on any other distributed computing framework.
- Any other methodology that is perfectly suited for this.