1

I have been reading about RecordIO here and there and checking different implementations on github here, and there. I'm simply trying to wrap my head around the pros of such a file format.

The pros I see are the following:

  1. Block compression. It will be faster if you need to read only a few records because less to decompress.
  2. Because of the somehow indexed structure you could lookup a specific record in acceptable time (assuming keys are sorted). This can be useful to quickly locate a record in an adhoc fashion.
  3. I can also imagine that with such a file format you can have finer sharding strategies. Instead of sharding per file you can shard per block.

But I fail to see how such a file format is faster for reading over some plain protobuf with compression.

Essentially I fail to see a big pro in this format.

jeremie
  • 971
  • 9
  • 19

0 Answers0