Questions tagged [intake]

The python data access and cataloguing project "Intake", https://intake.readthedocs.io/en/latest/

25 questions
9
votes
1 answer

Why is performance so much better with zarr than parquet when using dask?

When I run essentially the same calculations with dask against zarr data and parquet data, the zarr-based calculations are significantly faster. Why? Is it maybe because I did something wrong when I created the parquet files? I've replicated the…
Christine
  • 135
  • 1
  • 7
4
votes
1 answer

Column Name Shift using read_csv in Dask

I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation. The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is…
Brenton
  • 85
  • 1
  • 8
2
votes
2 answers

partitioning intake data sources

I have a large dataset of daily files located at /some/data/{YYYYMMDD}.parquet (or can also be smth like /some/data/{YYYY}/{MM}/{YYYYMMDD}.parquet). I describe data source in mycat.yaml file as follows: sources: source_paritioned: args: …
Mikhail Shevelev
  • 408
  • 5
  • 12
1
vote
1 answer

Listing available drivers in intake

How can I list all the available drivers in intake? I attempted to run dir on intake.source, but didn't manage to find a listing of drivers.
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
1
vote
1 answer

can I define data filters with intake catalogs?

I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some…
Thomas
  • 23
  • 5
1
vote
2 answers

Intake: catalogue level parameters

I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources? Consider a simple YAML-catalogue with two sources: sources: data1: args: …
Mikhail Shevelev
  • 408
  • 5
  • 12
1
vote
1 answer

Apache Arrow or feather plugin?

I'd like to use local feather-files as a sources in Intake. Is the plugin for feather/arrow not yet existing or am I missing something?
bowlby
  • 649
  • 1
  • 8
  • 18
1
vote
1 answer

How to open a json file with Intake?

I'm trying to use intake to create a data catalogue for a JSON file. #197 mentions "Essentially, you need to provide the reader function json.loads, if each of your files is a single JSON block which evaluates to a list of objects." I created a…
rdmolony
  • 601
  • 1
  • 7
  • 15
1
vote
1 answer

How to aggregate large number of small csv files (~50k files each 120kb) efficiently (code size, scheduler+cluster runtime) with dask?

I have a data set that contains a timeseries per file. I am really happy how dask handles ~1k files on our cluster (one directory in my case). But I have around 50 directories. The funny thing that happens is, that the building the dask graph seems…
till
  • 570
  • 1
  • 6
  • 22
1
vote
1 answer

How can I add customized method to return the data source not only in dask format in a plugin but also in several different custom formats?

I am working on an intake plugin that allows to read specific JSON files from Github. These JSON files contain basic information about systems that we want to simulate with different simulation software, each with its own input format. We have…
TMS
  • 21
  • 1
1
vote
1 answer

dataset view and access control in yaml file

I am new to intake and I am trying to understand how I can control the visibility and access rights for catalog entries. For example I would like to find out how a catalog yaml file looks like for the following case, suppose I have two csv files to…
TMS
  • 21
  • 1
1
vote
1 answer

TLS-enabled communication between intake client and intake server

Within intake official documentation, it mention Authorization plugins are classes that can be used to customize access permissions to the Intake catalog server. The Intake server and client communicate over HTTP, so when security is a…
alex
  • 61
  • 4
1
vote
1 answer

How to do always necessary pre processing / cleaning with intake?

I'm having a use case where: I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.) I cannot…
willirath
  • 63
  • 4
1
vote
1 answer

data persistence to original data source

Can anybody tell me about below use case is make sense and applicable to intake software component. We like to use intake to build an abstraction layer or API service endpoint to encapsulate typical data operations, such as data retrieval and data…
alex
  • 61
  • 4
1
vote
1 answer

Data source on GCP BigQuery

I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource
alex
  • 61
  • 4
1
2