I'm adding this as an answer to get some more space than in the comments.
Working fast with 'big data'
200 GB of text files is reasonably big data which require significant effort to speed up the processing, or significant wait time. There's no easy way around ;)
- you need to get your data to memory to start any work
- it is the fastest to read your files one by one (NOT in parallel) when reading from a single hard drive
- measure how much time it takes to load the data without parsing
- your load time for multiple similar files will be just a multiple of the single file time, you can't get any magic improvements here
- to improve the load time you can compress the input files - it pays of only if you'll be using the same data source multiple times (after compression, less bytes must cross the hard drive -> memory boundary, which is slow)
- when choosing how to compress the data, you aim at load(compressed)+decompress times to be smaller than load(decompressed)
- you need to parse the raw data
- measure how much time it takes to parse the data
- if you cannot separate the parsing, measure how much time it takes to load and parse the data, the parse time is then the difference to the previously measured load time
- parsing can be parallelized, but it makes sense only if that is a substantial part of the load time
- you need to do your thing
- this usually can be done in parallel
- you need to save the results
- unless the results are as huge as the input, you don't care
- if they're huge, you need to serialize your IO again, that is save it one by one, not in parallel
- again compression helps, if you choose algorithm and settings where the compression time + write time is smaller than write time of the uncompressed data
To get raw load times, bash
is your friend. Using pipe viewer
or the builtin time
you can easily check the time it takes to read through a file by doing
pv mydata.txt > /dev/null
# alternatively
time cat mydata.txt > /dev/null
Be aware that your disk cache will kick in, when you'll be repeatedly measuring a single file.
As for the compression, if you're stuck with R, gzip
is the only reasonable option. If you'll do some pre-processing in bash
, lz4
is the tool of choice, because it's really fast at decent compression ratios.
gzip -3 mydata.txt
pv mydata.txt.gz | zcat > /dev/null
Here we're getting to the pre-processing. It pays of to use UNIX tools which tend to be really fast to pre-process the data before loading to R. You can filter columns with cut
, filter rows with mawk
(which is often much faster than gawk
).