0

We've got a case where we'd like to connect to HDFS and log a message to a Kafka topic on changes. Essentially change data capture on HDFS. I know it's an unusual case where you're trying to capture changes from HDFS and unfortunately that's making searching difficult. We don't have access to the sources that are feeding into HDFS so change data capture on HDFS seems like our only real option.

I don't need to read the files themselves. Being able to put a message on to a topic with the full path to the file and a little other minor information and would be sufficient enough. I would however need to handle Kerberos authentication for HDFS.

It looks like Confluent has a HDFS2SourceConnector and a HDFS3SourceConnector. Sadly, these pieces are not open source and it's been difficult to understand their documentation. They seem to be dependent upon some file system structure from the HDFS2SinkConnector and HDFS3SinkConnector. The license is not an issue if these would work for this purpose. I've been trying to get something working here, but without luck. It's not clear what events it triggers on and where/how it writes to a topic.

I've also stumbled across this https://github.com/mmolimar/kafka-connect-fs but it hasn't been updated in a while, appears to require an implementation of a FileReader and I don't see support for Kerberos out of the box. I could probably modify to suit our use case.

Are there other alternatives out there or better documentation or examples for the Confluent plugins?

rschlachter
  • 720
  • 1
  • 9
  • 20
  • Can you not change the source to write to some system *other than HDFS* that does support CDC? – OneCricketeer Jan 20 '20 at 20:14
  • Personally, I would opt for Spark if you want to watch HDFS files and forward that event to Kafka. https://stackoverflow.com/questions/44375980/how-to-process-new-files-in-hdfs-directory-once-their-writing-has-eventually-fin Otherwise, wrap this in your own Kafka Source Connector - https://stackoverflow.com/questions/29960186/hdfs-file-watcher – OneCricketeer Jan 20 '20 at 20:17
  • @FatemaSagar Thanks for the link to the actual HDFS2 docs. Do you know of a better example than the docs? Also, it looks like that connector wants to read the file, and I really just need to know that a file changed, and not actually read it in any way. – rschlachter Jan 22 '20 at 19:04

1 Answers1

1

Sounds like you want this

https://kafka-connect-fs.readthedocs.io/en/latest/connector.html#hdfs-file-watcher

hasn't been updated in a while

Lacking commits can indicate project stability, not lack of development. You're welcome to open Github issues and see if you get responses. Otherwise, you are seemingly locked into Confluent/Community support.

better documentation or examples for the Confluent plugins

You can send feedback to the docs team at mailto:docs@confluent.io?subject=Documentation Feedback


IMO, HDFS is primarily designed for a write-once, read-many architecture, so I would advise trying to change your datalake storage to something like S3, on which you can trigger lambda actions

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245