We've got a case where we'd like to connect to HDFS and log a message to a Kafka topic on changes. Essentially change data capture on HDFS. I know it's an unusual case where you're trying to capture changes from HDFS and unfortunately that's making searching difficult. We don't have access to the sources that are feeding into HDFS so change data capture on HDFS seems like our only real option.
I don't need to read the files themselves. Being able to put a message on to a topic with the full path to the file and a little other minor information and would be sufficient enough. I would however need to handle Kerberos authentication for HDFS.
It looks like Confluent has a HDFS2SourceConnector and a HDFS3SourceConnector. Sadly, these pieces are not open source and it's been difficult to understand their documentation. They seem to be dependent upon some file system structure from the HDFS2SinkConnector and HDFS3SinkConnector. The license is not an issue if these would work for this purpose. I've been trying to get something working here, but without luck. It's not clear what events it triggers on and where/how it writes to a topic.
I've also stumbled across this https://github.com/mmolimar/kafka-connect-fs but it hasn't been updated in a while, appears to require an implementation of a FileReader and I don't see support for Kerberos out of the box. I could probably modify to suit our use case.
Are there other alternatives out there or better documentation or examples for the Confluent plugins?