3

I'm quite new using Kedro and after installing kedro in my conda environment, I'm getting the following error when trying to list my catalog:

Command performed: kedro catalog list

Error:

kedro.io.core.DataSetError: An exception occurred when parsing config for DataSet df_medinfo_raw: Object ParquetDataSet cannot be loaded from kedro.extras.datasets.pandas. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pandas.ParquetDataSet:

I installed kedro trough conda-forge: conda install -c conda-forge "kedro[pandas]". As far as I understand, this way to install kedro also installs the pandas dependencies.

I tried to read the kedro documentation for dependencies, but it's not really clear how to solve this kind of issue.

My kedro version is 0.17.6.

eglease
  • 2,445
  • 11
  • 18
  • 28
Rubens Rodrigues
  • 165
  • 1
  • 2
  • 10

2 Answers2

3

Kedro uses Pandas to load ParquetDataSet objects, and Pandas requires additional dependencies to accomplish this (see "Installation: Other data sources"). That is, in addition to Pandas, one must also install either fastparquet or pyarrow.

For Conda you either want:

## use pyarrow for parquet
conda install -c conda-forge kedro pandas pyarrow

or

## or use fastparquet for parquet
conda install -c conda-forge kedro pandas fastparquet

Note that the syntax used in the question kedro[pandas] is meaningless to Conda (i.e., it ultimately parses to just kedro). Conda package specification uses a custom grammar called MatchSpec, where anything inside a [...] is parsed for a [key1=value1;key2=value2;...] syntax. Essentially, the [pandas] is treated as an unknown key, which is ignored.

merv
  • 67,214
  • 13
  • 180
  • 245
  • Thanks Merv! After installing pandas properly into Kedro (kedro[name='pandas'] as well as installing pyarrow, I could list the catalogs. Tks – Rubens Rodrigues Jan 17 '22 at 12:36
2

Try installing using pip

pip install "kedro[pandas]"

As of now, conda doesn't support optional dependencies. Feature request for the same is submitted here https://github.com/conda/conda/issues/7502

Also, in kedro docs its mentioned pip is recommended https://kedro.readthedocs.io/en/stable/02_get_started/02_install.html

It is also possible to install Kedro using conda, as follows, but we recommend using pip at this point to eliminate any potential dependency issues, as follows:

Also, as @datajoely mentioned, you can also be more specific about which all dataset modules you need with the following.

pip install "kedro[pandas.ParquetDataSet]"

You can read more about kedro dependencies here https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html?highlight=top-level#workflow-dependencies

Rahul Kumar
  • 2,184
  • 3
  • 24
  • 46
  • 1
    You can also be specific with `pip install kedro[pandas.ParquetDataSet]` - I have an example here https://github.com/datajoely/modular-spaceflights/blob/260b209c24c7440342b41fb02f218d39c9115220/src/requirements.in#L9 – datajoely Jan 15 '22 at 11:48
  • 1
    Thanks Rahul. I went through Conda, because we are not using pip here in the project. However, I've seen an example with PIP that works in the way you proposed. – Rubens Rodrigues Jan 17 '22 at 12:38