How to transform nested JSON-payloads with Kiba-ETL?

Question

I want to transform nested JSON-payloads into relational tables with Kiba-ETL. Here's a simplified pseudo-JSON-payload:

{
  "bookings": [
    {
      "bookingNumber": "1111",
      "name": "Booking 1111",
      "services": [
        {
          "serviceNumber": "45",
          "serviceName": "Extra Service"
        }
      ]
    },
    {
      "bookingNumber": "2222",
      "name": "Booking 2222",
      "services": [
        {
          "serviceNumber": "1",
          "serviceName": "Super Service"
        },
        {
          "serviceNumber": "2",
          "serviceName": "Bonus Service"
        }
      ]
    }
  ]
}

How can I transform this payload into two tables:

bookings
services (every service belongsTo a booking)

I read a about yielding multiple rows with the help of Kiba::Common::Transforms::EnumerableExploder at wiki, blog, ... etc.

Would you solve my use-case by yielding multiple rows (the booking and multiple services), or would you implement a Destination which receives a whole booking and calls some Sub-Destinations (i.e. to create or update a service)?

score 2 · Accepted Answer · answered Jul 03 '19 at 16:31

Author of Kiba here!

This is a common requirement, but it can (and this is not specific to Kiba) be more or less complex to handle. Here are a few points you'll need to think about.

Handling of foreign keys

The main problem here is that you'll want to keep the relationships between services and bookings, once they are inserted.

Foreign keys using business keys

A first (most easy) way to handle this is to use a foreign-key constraint on "booking number", and make sure to insert that booking number in each service row, so that you can leverage it later in your queries. If you do this (see https://stackoverflow.com/a/18435114/20302) you'll have to set a unique-constraint on "booking number" in the bookings table target.

Foreign keys using primary keys

If you instead prefer to have a booking_id which points to the bookings table id key, things are a bit more complicated.

If this is a one-off import targeting an empty table, I recommend that you arbitrarily force the primary key using something like:

transform do |r|
  @row_index ||= 0
  @row_index += 1
  r.merge(id: @row_index)
end

If this not a one-off import, you will have to: * Upsert bookings in a first pass * In a second pass, look-up (via SQL queries) "bookings" to figure out what is the id to store in booking_id, then upsert the services

As you see it's a bit more work, so stick with option 1 if you don't have strong requirements around this (although option 2 is more solid on the long run).

Example implementation (using Kiba Pro & business keys)

The simplest way to achieve this (assuming your target is Postgres) is to use Kiba Pro's SQL Bulk Insert/Upsert destination.

It would go this way (in single pass):

extend Kiba::DSLExtensions::Config
config :kiba, runner: Kiba::StreamingRunner

source Kiba::Common::Sources::Enumerable, -> { Dir["input/*.json"] }

transform { |r| JSON.parse(IO.read(r)).fetch('bookings') }

transform Kiba::Common::Transforms::EnumerableExploder

# SNIP (remapping / renaming of fields etc)

first_destination = nil

destination Kiba::Pro::Destinations::SQLBulkInsert,
  row_pre_processor: -> (row) { row.except("services") },
  dataset: -> (dataset) {
    dataset.insert_conflict(target: :booking_number)
  },
  after_read: -> (d) { first_destination = d }

destination Kiba::Pro::Destinations::SQLBulkInsert,
  row_pre_processor: -> (row) { row.fetch("services") },
  dataset: -> (dataset) {
    dataset.insert_conflict(target: :service_number)
  },
  before_flush: -> { first_destination.flush }

Here we iterate over each input file, parsing it and grabbing the "bookings", then generating one row per element of "bookings".

We have 2 destinations, doing "upsert" (insert or update), plus one trick to ensure we'll save the parent rows before we insert the children, to avoid a failure due to missing pointed record.

You can of course implement this yourself, but this is a bit of work though!

If you need to use primary-key based foreign keys, you'll have (likely) to split in 2 pass (one for each destination), then add some form of lookup in the middle.

Conclusion

I know that this is not trivial (depending on what you'll need, & if you'll use Kiba Pro or not), but at least I'm sharing the patterns that I'm using in such situations.

Hope it helps a bit!

thank you for the explanation and your thoughts. I have one further question: I cannot do one-off imports. I have to implement an incremental approach. If bookings are updated, they have to be reimported. Where do you delete existing bookings and their children? At the booking-destination? — wuarmin, Jul 03 '19 at 19:03