4


what is the preferred way of storing protobuf encoded data in HDFS. Currently I see two possible solutions:
a) sequence files: storing the serialized/encoded binary data, i.e., the "byte[]" in the corresponding value of a sequence file.

b) Parquet: Parquet provides protobuf/Parquet converters. So, my assumption is that when using those converters the binary data first must be deserialized into an object represenation and afterwards that object must be passed to the protobuf/Parquet converter to store it in Parquet. I assume doing so will result in higher performance costs compared to solution a). As I have to process an high amount of small protobuf encoded data chunks (streamed vehicle data which are provided by Kafka) performance and memory costs are important aspects.

c) are there further alternatives?

To sum up: I'm looking for a solution to store many small protobuf encoded data chunks (i.e. vehicle sensor data) in HDFS thereby leaving the raw data as much as possible untouched. However, it must be ensured that the data can be processed afterwards using Map/Reduce or Spark.

Best, Thomas

Thomas Beer
  • 230
  • 3
  • 9
  • It has been almost a year since you asked this question. I'm starting a project where I'm looking to store protobuf data in HDFS. Do you mind sharing what your approach for this ended up being? – Luis Medina May 14 '16 at 09:54
  • 1
    For that project we used solution a), i.e. SequenceFiles to store the protobuf encoded data. – Thomas Beer May 20 '16 at 07:34
  • @ThomasBeer - could you share your solution? I have found myself in a similar situation! Thanks – AngryPanda Jul 15 '16 at 22:39
  • As said above, we have applied solution a), i.e., storing the protobuf encoded data (bytes) in SequenceFiles. The SequenceFiles are structured regarding creation time (e.g. 2016/03/30/01 ...). We use a sequence file's key for referring to stored metadata (e.g. who produced that data, which "schema" was used, etc.). – Thomas Beer Jul 18 '16 at 09:28

0 Answers0