0

While using Kedro I want to load some data and work with it. To do that, one has to register the data in a conf/base/catalog.yml file. The Kedro Documentation of the Data Catalog explains how one can register data for Kedro to load. However, there is little to no information on how to load a .arrow file.

In the conf/base/catalog.yml I tried to register my data thus:

dataframe:
  type: arrow.ArrowDataSet
  filepath: "home/place/data.arrow"
  layer : primary

And ofcourse tried on different combinations from the data catalog documentation mentioned above.
The error code I get is the following :
DataSetError: An exception occurred when parsing config for DataSet 'dataframe': Class 'arrow.ArrowDataSet' not found or one of its dependencies has not been installed.

I have ofcourse installed the arrow package in my environment.

Does the Kedro Data Catalog simply not accept .arrow files or is there a way to register such a format in the catalog.yml file?

Thanks in advance,

Jamal

  • 1
    It doesn't look like kedro supports `arrow`. But anyway, if you are going to store data as files, you should use `parquet`, not `arrow`. See examples here: https://kedro.readthedocs.io/en/stable/data/data_catalog.html?highlight=parquet#example-10-loads-saves-a-parquet-file-on-local-file-system-storage-using-specified-load-and-save-arguments – 0x26res Jan 17 '23 at 14:00
  • hey @0x26res , That is the solution I finally went with, but the reason was only because Kedro supports .parquet. However, it sounds like you have more reasons to use parquet to save data as files. Am I right? and if yes, can you please spare some time and illuminate me? This is much appreciated, Thank you!! – Jamal Rnjbal Jan 19 '23 at 10:39
  • 1
    https://stackoverflow.com/questions/56472727/difference-between-apache-parquet-and-arrow – 0x26res Jan 19 '23 at 11:08

1 Answers1

1

Like said @0x26res, you can use the parquet dataset or others that kedro supports. Parquet could be handled in kedro by pyarrow engine because under the hood is pandas read_parquet with 2 engines and pyarrow by default.

It may be necessary to install dependencies to use other dataset types:

pip install kedro[pandas.ParquetDataSet]
matt91t
  • 103
  • 1
  • 8