Once the CSV is loaded via read.csv
, it's fairly trivial to use multicore
, segue
etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.
Realise it's better to use mySQL etc etc.
Assume the use of an AWS 8xl cluster compute instance running R2.13
Specs as follows:
Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Any thoughts / ideas much appreciated.