2

We'd like to run Kiba as a batch process on a series of files. What would be the best structure to give a file mask, download the files from FTP, and then run the ETL job on each, sending a success or failure notification on a per file basis?

Is there a way to do this from within Kiba, or is the best practice just to handle all the non-ETL stuff externally, and then just call kiba on each file?

Steve Wetzel
  • 435
  • 4
  • 9

1 Answers1

1

I would initially start with the simplest possible thing, which is like you said, using external files then calling Kiba on each one. E.g. :

  • Build a rake task to download the files locally (and remove them from the FTP, or at least move them to a separate folder to avoid double-processing), inside a well-known folder which will act as an inbox. See here for interesting links on how to do that.
  • Then build another rake task to iterate over the inbox folder and process a given file (using Dir[pattern].each).

Make sure to use a helper such as:

def system!(command)
  fail "Command #{command} failed" unless system(command)
end

to make sure you detect failures in execution when making system calls.

For your ETL file itself, you would use one at_exit block to capture failure and notify accordingly (see example here with Bugsnag, and a post_process block to capture success and notify in that case.

This will definitely work and is simple, that said there are other possibilities, such as a single ETL file which will download files in a pre_process block, then have a source which will yield one filename per downloaded file, and maybe a transform which could itself call kiba on the command line, or even more advanced solutions.

I would stick to the simplest possible solution to get started, as always!

halfer
  • 19,824
  • 17
  • 99
  • 186
Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27
  • Thanks, at_exit looks good. Is there any way to create something like a base class with the exit code in it, so we don't have to repeat it in every .etl file? – Steve Wetzel Sep 12 '16 at 17:53
  • 1
    Typically, projects with many .etl files will indeed require a central "common.rb" boilerplate, which can either directly call at_exit, or provide a "setup" method which will itself calls at_exit. This works fairly well at a decent scale. For further modularisation you can create dedicated modules or DSL extensions, and require accordingly etc on a per-etl basis. Let me know if this is clear! – Thibaut Barrère Sep 12 '16 at 18:27
  • Awesome, we were already using common.rb, but didn't realize you could put your at_exit code inside there. That works well, thanks. – Steve Wetzel Sep 12 '16 at 18:41
  • 1
    Yup! It's just regular Ruby (despite the way Kiba works for the declarative syntax) - check out this article for more http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code#a-different-way-to-structure-data-processing-code; finally, you may want to filter out the SystemExit exception (see http://www.bigfastblog.com/ruby-exit-exit-systemexit-and-at_exit-blunder). – Thibaut Barrère Sep 12 '16 at 18:46