How to do always necessary pre processing / cleaning with intake?

Question

I'm having a use case where:

I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)

If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:

1. Load unprocessed data with intake and apply the pre-processing immediately:

import intake
from my_tools import pre_process

cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)

2. Apply the pre-processing step with the `.read()` call.

Catalog:

sources:
  some_data:
    args:
      urlpath: "/path/to/some_raw_data.csv"
    description: "Some data (already preprocessed)"
    driver: csv
    preprocess: my_tools.pre_process

And:

import intake

cat = intake.open_catalog('...')
df = cat.some_data.read()

score 0 · Answer 1 · answered Apr 30 '20 at 16:45

Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.

However, you have a couple of options within Intake that you could consider alongside Option 1., above:

make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
write an alias-type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.

How to do always necessary pre processing / cleaning with intake?

1. Load unprocessed data with intake and apply the pre-processing immediately:

2. Apply the pre-processing step with the .read() call.

1 Answers1

2. Apply the pre-processing step with the `.read()` call.