4

I am learning about AWS these days. So, I am sorry if this question is too basic. I've read a bunch of open and closed questions on the benefits of Parquet over CSV (answered: What are the pros and cons of parquet format compared to other formats?), and RecordIO-protobuf in terms of file vs. pipe mode (e.g. unanswered What makes RecordIO attractive). However, I haven't seen any comparison between RecordIO-protobuf and Parquet.

Here's what I could gather from my research:

Other than above two differences what are the pros and cons of using Parquet vs. recordIO format? Moreover, searching for "Parquet vs. RecordIO" gave me zero Google results, which makes me think that I am comparing apples with oranges.

I'd appreciate any thoughts.

watchtower
  • 4,140
  • 14
  • 50
  • 92
  • 4
    Parquet is most commonly used for data analytics. Its very efficient way to store data for use use in Athena, Glue, EMR if we just consider AWS. RecordIO is more for binary streaming data, e.g. images. You can't use RecordIO in data analytcs at AWS. – Marcin Feb 28 '21 at 00:06

1 Answers1

1

Parquet is great for analytics data due to its small file size and allows you to scan only the columns of interest.

RecordIO format is typically used for training machine learning models so that the data that the model needs is presented only when needed.

David E
  • 13
  • 4