Modify a range of rows after applying transformations

Question

I want to write a kiba transformation that allows me to insert the same information for an specific number of rows. In this case i have an xls file that contains subheaders, and this subheaders contain data as well, like this:

Client: John Doe, Code: 1234
qty, date, price
1, 12/12/2017, 300.00
6, 12/12/2017, 300.00
total: 2100
Client: Nick Leitgeb, Code: 2345
qty, date, price
1, 12/12/2017, 100.00
2, 12/12/2017, 200.00
2, 12/12/2017, 50.00
total: 600
Client: …..

In order to extract relevant data i use the next transformation, which returns rows that matches at least one regex of the two provided (a date or the ‘Client’ word)

transform, SelectByRegexes regex: [/\d+\/\d+\/\d+/, /Client:/], matches: 1

This will give me the next result:

Client: John Doe, Code: 1234
1, 12/12/2017, 300.00
6, 12/12/2017, 300.00
Client: Nick Leitgeb, Code: 2345
1, 12/12/2017, 100.00
2, 12/12/2017, 200.00
2, 12/12/2017, 50.00
…..

Now that i have the information that i want, i need to replicate the client and the code for each sub row, and delete the subheader

John Doe, 1234, 1, 12/12/2017, 300.00
John Doe, 1234, 6, 12/12/2017, 300.00
Nick Leitgeb, 2345, 1, 12/12/2017, 100.00
Nick Leitgeb, 2345, 2, 12/12/2017, 200.00
Nick Leitgeb, 2345, 2, 12/12/2017, 50.00

The only way i can think to do this, is by doing it directly on the source or in a pre_process block, but will need the transformations used before in order to show the necessary data, is it possible to use a transformation class inside the source/pre_process block?, or manipulating multiple rows in a transformation ?

Kiba 2 (out a few weeks ago) provides a much better way to handle that now - its "StreamingRunner", which allows to yield multiple times from a transform. Check it out! https://github.com/thbar/kiba/releases/tag/v2.0.0 — Thibaut Barrère, Jan 18 '18 at 21:32

score 3 · Accepted Answer · answered Feb 28 '17 at 17:26

Kiba author here! Thanks for using Kiba. You are right that you could achieve that from a specialized source, but I personally prefer to use the following pattern:

last_seen_client_row = nil
logger = Logger.new(STDOUT)

transform do |row|
  # detect "Client/Code" rows - pseudo code, adjust as needed
  if row[0] =~ /\AClient:\z/
    # this is a top-level header, memorize it
    last_seen_client_row = row
    logger.info "Client boundaries detected for client XXX"
    next # remove the row from pipeline
  else
    # assuming you are working with arrays (I usually prefer Hashes though!) ; make sure to dupe the data to avoid
    last_seen_client_row.dup + row
  end
end

You could of course transform that block into a more testable class, and I can recommend to be very strict on your row detections to make sure you detect any changes in format and fail fast.

Hope this helps!

Thanks!, this helped a lot, used this strategy for processing multiple rows in other etl scripts as well! — José Añasco, Mar 08 '17 at 14:49
You welcome, glad to hear everything went fine! Please consider marking the answer as accepted to help out other viewers! — Thibaut Barrère, Mar 08 '17 at 16:21

Modify a range of rows after applying transformations

1 Answers1