I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)
If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:
1. Load unprocessed data with intake and apply the pre-processing immediately:
import intake
from my_tools import pre_process
cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)
2. Apply the pre-processing step with the .read()
call.
Catalog:
sources:
some_data:
args:
urlpath: "/path/to/some_raw_data.csv"
description: "Some data (already preprocessed)"
driver: csv
preprocess: my_tools.pre_process
And:
import intake
cat = intake.open_catalog('...')
df = cat.some_data.read()