I have a large dataset of daily files located at /some/data/{YYYYMMDD}.parquet
(or can also be smth like /some/data/{YYYY}/{MM}/{YYYYMMDD}.parquet
).
I describe data source in mycat.yaml file as follows:
sources:
source_paritioned:
args:
engine: pyarrow
urlpath: "/some/data/*.parquet"
description: ''
driver: intake_parquet.source.ParquetSource
I want to be able to read a subset of files (partitions) into memory,
If I run source = intake.open_catalog('mycat.yaml').source_partitioned; print(source.npartitions)
I get 0
. Probably because the partition information is not yet initialized. After source.discover()
, source.npartitions
is updated to 1726
which is exactly the number of individual files on disk.
How would I load data:
- only for a given day (e.g. 20180101)
- for a period between to days (e.g. between 20170601 and 20190223) ?
If this is described somewhere on the wiki, feel free to point me to the appropriate section.
Note: after thinking a little more, I realized this might be related to functionality of dask and probably my goal can be somehow achieved by converting the source to dask_dataframe with .to_dask
method. Therefore putting dask
label on this question.