It would be great if there was a way to get some kind of return object from a Kiba ETL run so that I could use the data in there to return a report on how well the pipeline ran.
We have a job that runs every 10 minutes that processes on average 20 - 50k records, and condenses them into summary records, some of which are created, and some of which are updated. The problem is, it's difficult to know what happened without trawling through reams of log files, and obviously, logs are useful to end users either.
Is there a way to populate some kind of outcome object with arbitrary data, as the pipeline runs? e.g
- 25.7k rows found in source
- 782 records dropped by this transformer
- 100 records inserted
- 150 records updated
- 20 records had errors (and here they are)
- This record had the highest statistic
- 1200 records belonged to this VIP customer
- etc.
And then at the end, use that data to send an email summary, populate a web page, render some console output, etc.
Currently, the only way I can see this working right now is to send an object in during setup and mutate it when it's flowing through the sources, transformers, and destinations. Once the run is complete, check the variable afterwards and do something with the data that is now in there.
Is this how it should be done, or is there a better way?
EDIT
Just want to add that I don't want to handle this in the post_process
block, because the pipeline gets used via a number of different mediums, and I would want each use case to handle its own feedback mechanism. It's also cleaner (imo) for an ETL pipeline to not have to worry about where it's used, and what that usage scenario's feedback expectations are...