0

We are using flume and S3 to store our events. I recognized that events are only transferred to S3 whenever the HDFS sink rolls to the next file or flume is shutdown gracefully.

This can, in my mind, lead to potential data loss. The Flume Documentation writes:

...Flume uses a transactional approach to guarantee the reliable delivery of the Events...

here my configuration:

agent.sinks.defaultSink.type = HDFSEventSink
agent.sinks.defaultSink.hdfs.fileType = DataStream
agent.sinks.defaultSink.channel = fileChannel
agent.sinks.defaultSink.serializer = avro_event
agent.sinks.defaultSink.serializer.compressionCodec = snappy
agent.sinks.defaultSink.hdfs.path = s3n://testS3Bucket/%Y/%m/%d
agent.sinks.defaultSink.hdfs.filePrefix = events
agent.sinks.defaultSink.hdfs.rollInterval = 3600
agent.sinks.defaultSink.hdfs.rollCount = 0
agent.sinks.defaultSink.hdfs.rollSize = 262144000
agent.sinks.defaultSink.hdfs.batchSize = 10000
agent.sinks.defaultSink.hdfs.useLocalTimeStamp = true

#### CHANNELS ####

agent.channels.fileChannel.type = file
agent.channels.fileChannel.capacity = 1000000
agent.channels.fileChannel.transactionCapacity = 10000

I assume that I just do something wrong, any Ideas?

TheRueger
  • 2,508
  • 1
  • 16
  • 12
  • It seems that the channel is not closing the transaction for the events that are not finally transferred to hdfs. I'm currently researching in this direction. – TheRueger Mar 22 '16 at 16:46

1 Answers1

1

After some investigation I found one of the main problems using S3 with flume and the HDFS Sink.

One of the main differences between plain HDFS and the S3 implementation is that S3 does not directly support rename. When a file is renamed in S3 the file will be copied and to the new name and the old file will be deleted. (see: How to rename files and folder in Amazon S3?)

Flume by default extend files with .tmp when the file is not full. After the rotation the file will be renamed to the final filename. In HDFS this will be no problem but in S3 this can cause problems according to this issue: https://issues.apache.org/jira/browse/FLUME-2445

Because S3 with HDFS sink seams not 100% trustworthy I prefer the more safe way of saving all files local and sync/delete the finished files with the aws tool s3 sync (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)

In worse case the files are not synced or the local disk is full but both problems can be easily solved via a monitoring system that anyways should be used.

Community
  • 1
  • 1
TheRueger
  • 2,508
  • 1
  • 16
  • 12