1

I am developing a Spark-Kafka Streaming program where i need to capture the kafka partition offsets, inorder to handle failure scenarios.

Most of the devs are using Hbase as a storage for offsets, but how would it be if i use a file on hdfs or local disk to store offsets which is simple and easy? I am trying to avoid using a Nosql for storing offsets.

Can i know what are the advantages and disadvantages of using a file over hbase for storing offsets?

codejitsu
  • 3,162
  • 2
  • 24
  • 38
AKC
  • 953
  • 4
  • 17
  • 46
  • Well... What if the hard drive where that file exists fails? HBase runs on HDFS, so really doesn't matter if you already have Hbase setup. Why don't you *store the offsets in Kafka*? Or, Zookeeper? https://stackoverflow.com/questions/45686885/how-does-kafka-store-offsets-for-each-topic – OneCricketeer Mar 03 '18 at 02:33

2 Answers2

1

Just use Kafka. Out of the box, Apache Kafka stores consumer offsets within Kafka itself.

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
  • could you please add advantages or disadvantages of using Kafka as storage for offsets? – Saeed Mohtasham Apr 22 '18 at 10:41
  • I'll put the question back on you. By default, and as designed by the Kafka project, it uses Kafka to manage offsets. What is your reason for wanting to deviate from this? – Robin Moffatt Apr 22 '18 at 16:16
0

I too have similar usecase, i prefer hbase because of following reasons-

  1. Easy retrieval, it stores data in sorted order of rowkey. Its helpful when the offsets belong to different data group.

  2. I had to capture start and end offset for a group of data where capturing start is easy but end offset..it though to capture in streaming mode. So I don't wanted to open a file update only end offset and close it.I had a thought of S3 as well but S3 objects are immutable.

Zookeeper can also be one option. Hope it helps .

Bishnu
  • 383
  • 4
  • 14