1

I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources?

Consider a simple YAML-catalogue with two sources:

sources:
  data1:
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    
  data2:
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

Note that both data sources (data1 and data2) make use of snapshot_date parameter inside urlpath argument? With this definition I can load data sources with:

cat = intake.open_catalog("./catalog.yaml")
cat.data1(snapshot_date="latest").read()   # reads from data/latest/data1.csv
cat.data2(snapshot_date="20211029").read() # reads from data/20211029/data2.csv

Please note that cat.data1().read() will not work, since snapshot_date defaults to empty string, so the csv driver cannot find the path "./data//data1.csv".

I can set the default value by adding parameters section to every (!) source like in the below.

sources:
  data1:
    parameters:
      snapshot_date:
        type: str
        default: "latest"
        description: ""
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    
  data2:
    parameters:
      snapshot_date:
        type: str
        default: "latest"
        description: ""
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

But this looks complicated (too much repetitive code) and a little inconvenient for the end user -- if a user wants to load all data sources from a given date, he has to explicitly provide snapshot_date parameter to every(!) data source at initialization. IMO, it would be nice I user can provide this value once when initializing the catalog.

Is there a way I can define snapshot_date parameter at catalog level? So that:

  • I can set default value (e.g. "latest" in my example) in the YAML-definition of the catalogue's parameter
  • or can pass catalogue's parameter value at runtimeduring the call intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
  • this value should be accessible in the definition of data sources of this catalog ?
cat = intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
cat.data1.read()  # will return data from ./data/20211029/data1.csv
cat.data2.read()  # will return data from ./data/20211029/data2.csv
cat.data2(snapshot_date="latest").read()  # will return data from ./data/latest/data1.csv

cat = intake.open_catalog("./catalog.yaml")
cat.data1.read()  # will return data from ./data/latest/data1.csv
cat.data2.read()  # will return data from ./data/latest/data2.csv

Thanks in advance

Mikhail Shevelev
  • 408
  • 5
  • 12

2 Answers2

2

This idea has been suggested before ( https://github.com/intake/intake/pull/562 , https://github.com/intake/intake/issues/511 ), and I have an inkling that maybe https://github.com/zillow/intake-nested-yaml-catalog supports something like you are asking.

However, I fully support adding this functionality in Intake, either based on #562, above, or otherwise. Adding it to the base Catalog and YAML file(s) catalog should be easy, but doing it so that it works for all subclasses might be tricky.

Currently, you can achieve what you want using environment variables, e.g., "{{snapshot_date}}"->"{{env(SNAPSHOT_DATE)}}", but you would ned to communicate to the user that this variable should be set. In addition, if the value is not to be used within a string, you would still need a parameter definition to cast to the right type.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks @mdurant! I like the motivation with DB connection settings parameters in the #562. This is something I faced too when started using [intake-sql](https://github.com/intake/intake-sql). I do not want to put sqlalchemy connection string into the catalog YAML file, so I have to pass within python code every time I initialize a data source. It would be more convenient to pass this as a parameter to catalog once and then sources can just use it. Would love to see this functionality added to intake. – Mikhail Shevelev Nov 04 '21 at 12:49
  • Consider it on the todo list – mdurant Nov 04 '21 at 13:09
0

This is a bit of a hack, but consider a yaml file with this content:

global_params:
  snapshot_date: &global
    default: latest
    description: ''
    type: str

sources:
  data1:
    args:
      urlpath: '{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv'
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    parameters:
      snapshot_date: *global
  data2:
    args:
      urlpath: '{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv'
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    parameters:
      snapshot_date: *global

Now intake will accept keyword argument for snapshot_date for specific sources.

Some relevant answers: 1 and 2.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46