1

I'm running Kiba ETL pipeline in a rails background job. I'd like to provide some status to the user while the job is running. What would be the best way to achieve this?

Can I use some variable somehow?

Or should I save the status update in the database after every step (once in source, once for every transform, once in destination)? Once for every transformation seems like a lot of additional db writing and also, it seems a bit "dirty" to talk to the database from transform.

Thanks!

Viktor
  • 2,982
  • 27
  • 32

1 Answers1

2

To implement that type of use-case, you have to incorporate some form of progress tracking in your job.

It could report to a database record (which would modelize the job - recommended if you are doing a bit heavy-weight imports and want to be able to search afterwards), but you can also report to some form of pub-sub system (redis, Postgres, ActionCable...) if you want something more instant & more lightweight.

A transform is actually a great place to track progress, but this does not mean you have to report at every single row (because it would cause a SQL write at each row, which is usually too much!).

What I recommend is to report the progress only every N rows, using code like this:

pre_process do
  @count ||= 0
end

transform do |r|
  @count += 1
  if @count % 500 == 0
    # TODO here: notify the report system
  end
  r
end

You will want to think about what happens if an error occurs while you are notifying the report system: maybe you want to halt everything, or maybe you want to continue.

Make sure also to track the beginning of the job, the end of the job (success/error/completeness) to make sure you don't end up with stale jobs.

It seems a bit "dirty" to talk to the database, but only because we are mixing concerns a bit. If you do it every N rows & make sure not to pollute the main system, it's perfectly fine!

halfer
  • 19,824
  • 17
  • 99
  • 186
Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27
  • Thank you for detailed answer! I'm using transformer class/classes array. How would I pass the count in that case then? I guess I could just pass the count variable to each transformer and they would all do their own % 500 check. – Viktor Apr 05 '20 at 09:46
  • 1
    I would only count right after the source or right before the destination, to avoid exactly that. Percentage of read and percentage of write are most of the time good enough! Do not clutter each of your transform with this! – Thibaut Barrère Apr 05 '20 at 11:12
  • I'd really like to show output on every step or as close as possible to that. My transforms take ~1s each, cuz I'm importing some images and I don't want to move that to destination/load phase. Would you recommend I manually call each of transform classes within the `transform` block? That seems like the cleanest solution to me. What do you think? – Viktor Apr 06 '20 at 18:02
  • 1
    I'll reach out privately to get more information & answer properly, because it's a bit hard for me to correctly advise you here! – Thibaut Barrère Apr 07 '20 at 09:36