5

I understand the differences between the 2, but still, it seems that i use KTable as a "default", not really know when to prefer a GlobalKTable.

Please share your experience, when does a GlobalKTable is a must, why not to use it etc.

Aaron_ab
  • 3,450
  • 3
  • 28
  • 42

1 Answers1

10

The key is that KTable is partitioned, meaning that if you have an underlying topic with N partitions, the instance that takes care of a subset of those partitions will have access to the data on those partitions, but not to the data on the partitions that this instance is not managing.

However, GlobalKTable will use all of the topic data in all of the instances. For example, you'd want to use it for a join with a set of external data whose partitioning is not directly linked with the incoming data (or cannot be predicted its relation).

E.g. Say you have a stream from a users topic, with default round-robin partitioning, that has a country field, and you need to enrich that users stream with data from the user's country. Then, you may use a GlobalKTable with data for the countries, and join e.g. a users stream with a that country GlobalKTable on the country.

Since GlobalKTable gives you access to all of the potential joinable data, it is much more efficient than a KTable for smaller data, because you don't need to repartition the data for that join(all of the data is right there). But you should be aware of the size: you have to handle all of the data set in each of the partitions. This is why it is normally used in limited-size data collections, and not super-big either.

If you perform a join between a KStream and a KTable, it would need to repartition data (creating an internal topic), to re-group data accordingly to the joining key.

Similarly, if you are using the Processor API, if you query a KTable from an instance, you'd have there the data that was generated by that instance, and not the other instances.

UPDATE: Also see @matthias-j-sax comment on synchronization.

xmar
  • 1,729
  • 20
  • 48
  • 1
    What if i have 1 partition? I updated the title of the question to reflect the usage of 1 partition – Aaron_ab Dec 17 '18 at 20:00
  • 1
    There is also as semantical difference for joins: using a KTable, the table side is synchronize with the KStream side based on timestamps. Using a GlobalKTable no synchronization happens and the join is non-deterministic with regard to table side updates. – Matthias J. Sax Dec 18 '18 at 06:59
  • If I have events going into a topic using the user_id as key, and I stream that topic out and do join on a ktable thats produced from another topic that is also using the user_id as key. Then I shouldn't worry about using GlobalKTables? – Daniel Cull Jan 22 '19 at 15:49
  • Yes (assuming default partitioning). But it's more about the use case, the nature (and size) of your data to join with and by what is this join made. In some cases GKT will be the best, easiest solution. – xmar Jan 23 '19 at 16:25