1

I'm having a use case where:

  • I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)

  • I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)

If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:

1. Load unprocessed data with intake and apply the pre-processing immediately:

import intake
from my_tools import pre_process

cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)

2. Apply the pre-processing step with the .read() call.

Catalog:

sources:
  some_data:
    args:
      urlpath: "/path/to/some_raw_data.csv"
    description: "Some data (already preprocessed)"
    driver: csv
    preprocess: my_tools.pre_process

And:

import intake

cat = intake.open_catalog('...')
df = cat.some_data.read()
willirath
  • 63
  • 4

1 Answers1

0

Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.

However, you have a couple of options within Intake that you could consider alongside Option 1., above:

  • make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
  • write an alias-type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.
mdurant
  • 27,272
  • 5
  • 45
  • 74