I have the following working in a unit test to write a single object in Avro/Parquet to a file in my Cloudera/HDFS cluster.
That said, given that Parquet is a columnar format, it seems like it can only write out an entire file in a batch mode (updates not supported).
So, what are the best practices for writing files for data ingested (via ActiveMQ/Camel) in real-time (small msgs at 1k msg/sec, etc)?
I suppose I could aggregate my messages (buffer in memory or other temp storage) and write them out in batch mode using a dynamic filename, but I feel like I'm missing something with the partitioning/file naming by hand, etc...
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://cloudera-test:8020/cm/user/hive/warehouse");
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false);
AvroReadSupport.setAvroDataSupplier(conf, ReflectDataSupplier.class);
Path path = new Path("/cm/user/hive/warehouse/test1.data");
MyObject object = new MyObject("test");
Schema schema = ReflectData.get().getSchema(object.getClass());
ParquetWriter<InboundWirelessMessageForHDFS> parquetWriter = AvroParquetWriter.<MyObject>builder(path)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
.withDataModel(ReflectData.get())
.withDictionaryEncoding(false)
.withConf(conf)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE) //required because the filename doesn't change for this test
.build();
parquetWriter.write(object);
parquetWriter.close();