I am learning about AWS these days. So, I am sorry if this question is too basic. I've read a bunch of open and closed questions on the benefits of Parquet over CSV (answered: What are the pros and cons of parquet format compared to other formats?), and RecordIO-protobuf in terms of file vs. pipe mode (e.g. unanswered What makes RecordIO attractive). However, I haven't seen any comparison between RecordIO-protobuf and Parquet.
Here's what I could gather from my research:
- Parquet is a columnar format, but RecordIO-protobuf is used for serialization.
- Not all SageMaker algorithms support Parquet. Most SageMaker algorithms work best in RecordIO-protobuf format. (https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html)
Other than above two differences what are the pros and cons of using Parquet vs. recordIO format? Moreover, searching for "Parquet vs. RecordIO" gave me zero Google results, which makes me think that I am comparing apples with oranges.
I'd appreciate any thoughts.