5

I'm working on a project where we need to stream real-time updates from Oracle to a bunch of systems (Cassandra, Hadoop, real-time processing, etc). We are planing to use Golden Gate to capture the changes from Oracle, write them to Kafka, and then let different target systems read the event from Kafka. There are quite a few design decisions that need to be made:

What data to write into Kafka on updates?

GoldenGate emits updates in a form of record ID, and updated field. These changes can be writing into Kafka in one of 3 ways:

  • Full rows: For every field change, emit the full row. This gives a full representation of the 'object', but probably requires making a query to get the full row.
  • Only updated fields: The easiest, but it's kind of a weird to work with as you never have a full representation of an object easily accessible. How would one write this to Hadoop?
  • Events: Probably the cleanest format ( and the best fit for Kafka), but it requires a lot of work to translate db field updates into events.

Where to perform data transformation and cleanup?

The schema in the Oracle DB is generated by a 3rd party CRM tool, and is hence not very easy to consume - there are weird field names, translation tables, etc. This data can be cleaned in one of (a) source system, (b) Kafka using stream processing, (c) each target system.

How to ensure in-order processing for parallel consumers?

Kafka allows each consumer to read a different partition, where each partition is guaranteed to be in order. Topics and partitions need to be picked in a way that guarantees that messages in each partition are completely independent. If we pick a topic per table, and hash record to partitions based on record_id, this should work most of the time. However what happens when a new child object is added? We need to make sure it gets processed before the parent uses it's foreign_id

EugeneMi
  • 3,475
  • 3
  • 38
  • 57

1 Answers1

0

One solution I have implemented is to publish only the record id into Kafka and in the Consumer, use a lookup to the origin DB to get the complete record. I would think that in a scenario like the one described in the question, you may want to use the CRM tool API to lookup that particular record and not reverse engineer the record lookup in your code. How did you end up implementing the solution ?

thcricketfan
  • 80
  • 1
  • 9