Highest Voted 'intake' Questions

9

votes

1 answer

Why is performance so much better with zarr than parquet when using dask?

When I run essentially the same calculations with dask against zarr data and parquet data, the zarr-based calculations are significantly faster. Why? Is it maybe because I did something wrong when I created the parquet files? I've replicated the…

python dask intake

asked Feb 26 '20 at 22:40

Christine

135
1
7

4

votes

1 answer

Column Name Shift using read_csv in Dask

I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation. The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is…

python pandas csv dask intake

asked Dec 11 '20 at 16:01

Brenton

85
1
8

2

votes

2 answers

partitioning intake data sources

I have a large dataset of daily files located at /some/data/{YYYYMMDD}.parquet (or can also be smth like /some/data/{YYYY}/{MM}/{YYYYMMDD}.parquet). I describe data source in mycat.yaml file as follows: sources: source_paritioned: args: …

python dask intake

asked Nov 09 '21 at 22:04

Mikhail Shevelev

408
5
12

1

vote

1 answer

Listing available drivers in intake

How can I list all the available drivers in intake? I attempted to run dir on intake.source, but didn't manage to find a listing of drivers.

python python-3.x driver intake

asked Aug 08 '22 at 17:01

SultanOrazbayev

14,900
3
16
46

1

vote

1 answer

can I define data filters with intake catalogs?

I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some…

python intake

asked Apr 28 '22 at 21:44

Thomas

23
5

1

vote

2 answers

Intake: catalogue level parameters

I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources? Consider a simple YAML-catalogue with two sources: sources: data1: args: …

python intake

asked Oct 30 '21 at 19:55

Mikhail Shevelev

408
5
12

1

vote

1 answer

Apache Arrow or feather plugin?

I'd like to use local feather-files as a sources in Intake. Is the plugin for feather/arrow not yet existing or am I missing something?

pyarrow feather intake

asked Mar 03 '21 at 12:34

bowlby

649
1
8
18

1

vote

1 answer

How to open a json file with Intake?

I'm trying to use intake to create a data catalogue for a JSON file. #197 mentions "Essentially, you need to provide the reader function json.loads, if each of your files is a single JSON block which evaluates to a list of objects." I created a…

intake

asked Dec 03 '20 at 17:59

rdmolony

601
1
7
15

1

vote

1 answer

How to aggregate large number of small csv files (~50k files each 120kb) efficiently (code size, scheduler+cluster runtime) with dask?

I have a data set that contains a timeseries per file. I am really happy how dask handles ~1k files on our cluster (one directory in my case). But I have around 50 directories. The funny thing that happens is, that the building the dask graph seems…

python pandas dataframe dask intake

asked Oct 24 '20 at 09:46

till

570
1
6
22

1

vote

1 answer

How can I add customized method to return the data source not only in dask format in a plugin but also in several different custom formats?

I am working on an intake plugin that allows to read specific JSON files from Github. These JSON files contain basic information about systems that we want to simulate with different simulation software, each with its own input format. We have…

intake

asked Jun 30 '20 at 12:54

TMS

21
1

1

vote

1 answer

dataset view and access control in yaml file

I am new to intake and I am trying to understand how I can control the visibility and access rights for catalog entries. For example I would like to find out how a catalog yaml file looks like for the following case, suppose I have two csv files to…

intake

asked Jun 26 '20 at 07:36

TMS

21
1

1

vote

1 answer

TLS-enabled communication between intake client and intake server

Within intake official documentation, it mention Authorization plugins are classes that can be used to customize access permissions to the Intake catalog server. The Intake server and client communicate over HTTP, so when security is a…

nginx intake

asked May 20 '20 at 06:31

alex

61
4

1

vote

1 answer

How to do always necessary pre processing / cleaning with intake?

I'm having a use case where: I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.) I cannot…

intake

asked Apr 29 '20 at 05:53

willirath

63
4

1

vote

1 answer

data persistence to original data source

Can anybody tell me about below use case is make sense and applicable to intake software component. We like to use intake to build an abstraction layer or API service endpoint to encapsulate typical data operations, such as data retrieval and data…

intake

asked Apr 17 '20 at 02:50

alex

61
4

1

vote

1 answer

Data source on GCP BigQuery

I tried to look for any existing intake components such as Driver, Plugin that can support GCP BigQuery. Given that if it cannot support, please advise on how to implement subclassing of intake.source.base.DataSource

intake

asked Apr 15 '20 at 11:44

alex

61
4

Questions tagged [intake]