I have been reading about RecordIO here and there and checking different implementations on github here, and there. I'm simply trying to wrap my head around the pros of such a file format.
The pros I see are the following:
- Block compression. It will be faster if you need to read only a few records because less to decompress.
- Because of the somehow indexed structure you could lookup a specific record in acceptable time (assuming keys are sorted). This can be useful to quickly locate a record in an adhoc fashion.
- I can also imagine that with such a file format you can have finer sharding strategies. Instead of sharding per file you can shard per block.
But I fail to see how such a file format is faster for reading over some plain protobuf with compression.
Essentially I fail to see a big pro in this format.