The python data access and cataloguing project "Intake", https://intake.readthedocs.io/en/latest/
Questions tagged [intake]
25 questions
9
votes
1 answer
Why is performance so much better with zarr than parquet when using dask?
When I run essentially the same calculations with dask against zarr data and parquet data, the zarr-based calculations are significantly faster. Why? Is it maybe because I did something wrong when I created the parquet files?
I've replicated the…

Christine
- 135
- 1
- 7
4
votes
1 answer
Column Name Shift using read_csv in Dask
I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation.
The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is…

Brenton
- 85
- 1
- 8
2
votes
2 answers
partitioning intake data sources
I have a large dataset of daily files located at /some/data/{YYYYMMDD}.parquet (or can also be smth like /some/data/{YYYY}/{MM}/{YYYYMMDD}.parquet).
I describe data source in mycat.yaml file as follows:
sources:
source_paritioned:
args:
…

Mikhail Shevelev
- 408
- 5
- 12
1
vote
1 answer
Listing available drivers in intake
How can I list all the available drivers in intake?
I attempted to run dir on intake.source, but didn't manage to find a listing of drivers.

SultanOrazbayev
- 14,900
- 3
- 16
- 46
1
vote
1 answer
can I define data filters with intake catalogs?
I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some…

Thomas
- 23
- 5
1
vote
2 answers
Intake: catalogue level parameters
I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources?
Consider a simple YAML-catalogue with two sources:
sources:
data1:
args:
…

Mikhail Shevelev
- 408
- 5
- 12
1
vote
1 answer
Apache Arrow or feather plugin?
I'd like to use local feather-files as a sources in Intake. Is the plugin for feather/arrow not yet existing or am I missing something?

bowlby
- 649
- 1
- 8
- 18
1
vote
1 answer
How to open a json file with Intake?
I'm trying to use intake to create a data catalogue for a JSON file. #197 mentions "Essentially, you need to provide the reader function json.loads, if each of your files is a single JSON block which evaluates to a list of objects."
I created a…

rdmolony
- 601
- 1
- 7
- 15
1
vote
1 answer
How to aggregate large number of small csv files (~50k files each 120kb) efficiently (code size, scheduler+cluster runtime) with dask?
I have a data set that contains a timeseries per file. I am really happy how dask handles ~1k files on our cluster (one directory in my case). But I have around 50 directories.
The funny thing that happens is, that the building the dask graph seems…

till
- 570
- 1
- 6
- 22
1
vote
1 answer
How can I add customized method to return the data source not only in dask format in a plugin but also in several different custom formats?
I am working on an intake plugin that allows to read specific JSON files from Github. These JSON files contain basic information about systems that we want to simulate with different simulation software, each with its own input format. We have…

TMS
- 21
- 1
1
vote
1 answer
dataset view and access control in yaml file
I am new to intake and I am trying to understand how I can control the visibility and access rights for catalog entries. For example I would like to find out how a catalog yaml file looks like for the following case, suppose I have two csv files to…

TMS
- 21
- 1
1
vote
1 answer
TLS-enabled communication between intake client and intake server
Within intake official documentation, it mention
Authorization plugins are classes that can be used to customize access
permissions to the Intake catalog server. The Intake server and client
communicate over HTTP, so when security is a…

alex
- 61
- 4
1
vote
1 answer
How to do always necessary pre processing / cleaning with intake?
I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot…

willirath
- 63
- 4
1
vote
1 answer
data persistence to original data source
Can anybody tell me about below use case is make sense and applicable to intake software component.
We like to use intake to build an abstraction layer or API service endpoint to encapsulate typical data operations, such as data retrieval and data…

alex
- 61
- 4
1
vote
1 answer
Data source on GCP BigQuery
I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource

alex
- 61
- 4