Best practice for using Kiba as a batch process on files

Question

We'd like to run Kiba as a batch process on a series of files. What would be the best structure to give a file mask, download the files from FTP, and then run the ETL job on each, sending a success or failure notification on a per file basis?

Is there a way to do this from within Kiba, or is the best practice just to handle all the non-ETL stuff externally, and then just call kiba on each file?

score 1 · Accepted Answer · edited Jul 14 '21 at 21:33

1

I would initially start with the simplest possible thing, which is like you said, using external files then calling Kiba on each one. E.g. :

Build a rake task to download the files locally (and remove them from the FTP, or at least move them to a separate folder to avoid double-processing), inside a well-known folder which will act as an inbox. See here for interesting links on how to do that.
Then build another rake task to iterate over the inbox folder and process a given file (using Dir[pattern].each).

Make sure to use a helper such as:

def system!(command)
  fail "Command #{command} failed" unless system(command)
end

to make sure you detect failures in execution when making system calls.

For your ETL file itself, you would use one at_exit block to capture failure and notify accordingly (see example here with Bugsnag, and a post_process block to capture success and notify in that case.

This will definitely work and is simple, that said there are other possibilities, such as a single ETL file which will download files in a pre_process block, then have a source which will yield one filename per downloaded file, and maybe a transform which could itself call kiba on the command line, or even more advanced solutions.

I would stick to the simplest possible solution to get started, as always!

edited Jul 14 '21 at 21:33

halfer

19,824
17
99
186

answered Sep 12 '16 at 17:09

Thibaut Barrère

8,845
2
22
27

Thanks, at_exit looks good. Is there any way to create something like a base class with the exit code in it, so we don't have to repeat it in every .etl file? – Steve Wetzel Sep 12 '16 at 17:53
1

Typically, projects with many .etl files will indeed require a central "common.rb" boilerplate, which can either directly call at_exit, or provide a "setup" method which will itself calls at_exit. This works fairly well at a decent scale. For further modularisation you can create dedicated modules or DSL extensions, and require accordingly etc on a per-etl basis. Let me know if this is clear! – Thibaut Barrère Sep 12 '16 at 18:27
Awesome, we were already using common.rb, but didn't realize you could put your at_exit code inside there. That works well, thanks. – Steve Wetzel Sep 12 '16 at 18:41
1

Yup! It's just regular Ruby (despite the way Kiba works for the declarative syntax) - check out this article for more http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code#a-different-way-to-structure-data-processing-code; finally, you may want to filter out the SystemExit exception (see http://www.bigfastblog.com/ruby-exit-exit-systemexit-and-at_exit-blunder). – Thibaut Barrère Sep 12 '16 at 18:46

Best practice for using Kiba as a batch process on files

1 Answers1