-1

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder. But my requirement is to read only given 4 from many files in the azure container.

Example: I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.

2 Answers2

1

For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.

Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:

file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv

you can specify filepath_suffix: _file_i_care_about.csv.

deepyaman
  • 538
  • 5
  • 16
0

Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use

my_partitioned_dataset:

type: "PartitionedDataSet"

path: "data/01_raw/subdirectory/"

dataset: "pandas.CSVDataSet"

Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.