4

CONTEXT

  • I'm new to Ruby and all that jazz, but I'm not new to dev.
  • I'm taking over a project based on 2 rails/puma repositories for web & APIs.
  • I'm building a new repository for a backend data processing app, using Kiba, that will run through scheduled jobs.
  • Also, I'm to be joined by other devs later on, so I'd like to make something maintainable by design.

MY QUESTION : Should I use Rails on that ETL project?

Using it means we can apply the same folder structure as the other repos, use RSpec all the same etc. It also appeared to me that Rails changes the way classes like Hash act.

At the same time, it seems to bring unnecessary complexity to a project that will run on CLI and could consist of only a dozen of files.

Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27
Tristan M
  • 133
  • 1
  • 3
  • Rails has its pros and cons. It's a bunch of boilerplate, but it can also make development quicker. – max pleaner Mar 03 '20 at 09:09
  • Kiba author here! This question is actually not opinion-based. I would like to expand and make a factual response here! @spickermann any chance to see this reopened? Thanks! – Thibaut Barrère Mar 03 '20 at 09:21

2 Answers2

3

Kiba author here! This is an important question, thanks for asking it!

MY QUESTION : Should I use Rails on that ETL project?

By default, I would recommend to start with a separate project (like a kind of "macro-service" approach), unless you have important things (more than just RSpec & ENV setup) to reuse from the Rails app.

If there is an important expected coupling between the app and the ETL (e.g. by "scheduled jobs" you mean jobs triggered through Sidekiq, to react to events, or you have classes shared between the 2 projects), then you can place the ETL in a etl subfolder of your Rails app, for instance, to provide a bit of separation and leave the opportunity to split the code out later if it becomes a better path (this is a middle ground I'm using on some projects).

If it is not the case, though, and the data pipeline is expected to become large and live its own life, you can instead split it to its own project.

Using it means we can apply the same folder structure as the other repos, use RSpec all the same etc.

You can use RSpec or minitest from a dedicated ETL (pure Ruby) project too, introduce a notion of ETL_ENV (development, test, production), build your own ENV-based (or file based) configuration with dotenv or similar, and support cron jobs from there too if you need that.

Pure Ruby projects can be structured just like a Rails app, and there is usually less magic (more explicit), which is helpful.

It also appeared to me that Rails changes the way classes like Hash act.

I would actually recommend to use an "explicit" approach about depending about that. Today I prefer to "cherry-pick" the exact extensions I need, at the top of each file (as described here).

One last word, you can test out Kiba ETL pipelines just as much as your individual ETL components, and I would recommend to do so (I will cover that in a future blog post), since it helps moving things around and upgrading Ruby with ease, and generally scale the team of developers easily (CI + tests).

I hope this provides enough guidance for you to take a decision on this, if this is not the case, please comment out!

Thibaut Barrère
  • 8,845
  • 2
  • 22
  • 27
1

From my point of view using Rails for ETL projects is an overhead. Take a look at dry-rb. Using https://dry-rb.org/gems/dry-system/ you can build a small application to process data. Also, there is a gem to build CLI https://dry-rb.org/gems/dry-cli/

Here is a list of all dry gems https://dry-rb.org/gems/

Timothy Alexis Vass
  • 2,526
  • 2
  • 11
  • 30
Yakov
  • 3,033
  • 1
  • 11
  • 22