0

I have the following working in a unit test to write a single object in Avro/Parquet to a file in my Cloudera/HDFS cluster.

That said, given that Parquet is a columnar format, it seems like it can only write out an entire file in a batch mode (updates not supported).

So, what are the best practices for writing files for data ingested (via ActiveMQ/Camel) in real-time (small msgs at 1k msg/sec, etc)?

I suppose I could aggregate my messages (buffer in memory or other temp storage) and write them out in batch mode using a dynamic filename, but I feel like I'm missing something with the partitioning/file naming by hand, etc...

Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://cloudera-test:8020/cm/user/hive/warehouse");

conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false);
AvroReadSupport.setAvroDataSupplier(conf, ReflectDataSupplier.class);

Path path = new Path("/cm/user/hive/warehouse/test1.data");

MyObject object = new MyObject("test");

Schema schema = ReflectData.get().getSchema(object.getClass());

ParquetWriter<InboundWirelessMessageForHDFS> parquetWriter = AvroParquetWriter.<MyObject>builder(path)
    .withSchema(schema)
    .withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
    .withDataModel(ReflectData.get())
    .withDictionaryEncoding(false)
    .withConf(conf)
    .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)   //required because the filename doesn't change for this test
    .build();

parquetWriter.write(object);
parquetWriter.close();
Ben ODay
  • 20,784
  • 9
  • 45
  • 68

1 Answers1

0

based on my (limited) research...I'm assuming that files can't be appended to (by design)...so I simply must batch real-time data (in memory or otherwise) before writing out files in parquet...

How to append data to an existing parquet file

Community
  • 1
  • 1
Ben ODay
  • 20,784
  • 9
  • 45
  • 68