Highest Voted 'dask-dataframe' Questions

6

votes

1 answer

Read group of rows from Parquet file in Python Pandas / Dask?

I have a Pandas dataframe that looks similar to this: datetime data1 data2 2021-01-23 00:00:31.140 a1 a2 2021-01-23 00:00:31.140 b1 b2 2021-01-23 00:00:31.140 c1 c2 2021-01-23 00:01:29.021 d1 …

asked Mar 06 '21 at 03:47

Mike

155
2
8

6

votes

2 answers

How to create unique index in Dask DataFrame?

Imagine I have a Dask DataFrame from read_csv or created another way. How can I make a unique index for the dask dataframe? Note: reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…

python pandas dataframe dask dask-dataframe

asked Jun 06 '19 at 10:54

Spar

463
1
5
23

5

votes

1 answer

Apply a function over the columns of a Dask array

What is the most efficient way to apply a function to each column of a Dask array? As documented below, I've tried a number of things but I still suspect that my use of Dask is rather amateurish. I have a quite wide and quite long array, in the…

python dask dask-distributed dask-dataframe dask-delayed

asked Dec 06 '22 at 00:23

chameau13

626
7
24

5

votes

1 answer

Implement Equal-Width Intervals feature engineering in Dask

In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals. For example, given the…

scikit-learn dask dask-distributed dask-dataframe

asked Aug 29 '22 at 18:22

ps0604

1,227
23
133
330

5

votes

0 answers

Dask distributed KeyError

I am trying to learn Dask using a small example. Basically I read in a file and calculate row means. from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=4, memory='24 GB') cluster.scale(4) from dask.distributed import Client client…

python dask dask-distributed dask-dataframe dask-jobqueue

asked Apr 09 '21 at 21:20

Phoenix Mu

648
7
12

5

votes

1 answer

Efficiently read big csv file by parts using Dask

Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first…

python csv dask dask-dataframe

asked Mar 18 '20 at 12:42

Mikhail_Sam

10,602
11
66
102

4

votes

1 answer

Setting maximum number of workers in Dask map function

I have a Dask process that triggers 100 workers with a map function: worker_args = .... # array with 100 elements with worker parameters futures = client.map(function_in_worker, worker_args) worker_responses = client.gather(futures) I use docker…

python dask dask-distributed dask-dataframe dask-delayed

asked Nov 03 '22 at 14:06

ps0604

1,227
23
133
330

4

votes

2 answers

Is there a way to traverse through a dask dataframe backwards?

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

python pandas dask parquet dask-dataframe

asked Jun 23 '22 at 20:12

Anina Hitt

61
3

4

votes

1 answer

Get column value after searching for row in dask

I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3. Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…

python pandas dataframe dask dask-dataframe

asked Apr 13 '21 at 08:30

Tanmay Bhatnagar

2,330
4
30
50

4

votes

1 answer

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes. In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask. def merge_dfs(df1, df2,…

python pandas dataframe dask dask-dataframe

asked Apr 05 '21 at 09:36

Eliran Turgeman

1,526
2
16
34

4

votes

1 answer

Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'

I'm working with a Dask Cluster on GCP. I'm using this code to deploy it: from dask_cloudprovider.gcp import GCPCluster from dask.distributed import Client enviroment_vars = { 'EXTRA_PIP_PACKAGES': '"gcsfs"' } cluster = GCPCluster( …

pandas dockerfile dask dask-dataframe

asked Feb 26 '21 at 15:51

Paula Vallejo

43
6

4

votes

2 answers

How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

Goal I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column. CODE mydtype = {'col1': 'object', 'col2': 'object', 'col3': 'object', 'col4': 'float32',} do =…

python pandas csv dask dask-dataframe

asked Feb 24 '21 at 11:32

sogu

2,738
5
31
90

4

votes

1 answer

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…

python pandas dask dask-distributed dask-dataframe

asked Dec 30 '20 at 01:07

Lostsoul

25,013
48
144
239

4

votes

0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…

dask parquet pyarrow dask-dataframe

asked Dec 12 '20 at 21:37

Michael Wheeler

849
1
10
29

4

votes

3 answers

Dask: convert a dask.DataFrame to an xarray.Dataset

This is possible in pandas. I would like to do it with dask. Edit: raised on dask here FYI you can go from an xarray.Dataset to a Dask.DataFrame Pandas solution using .to_xarry: import pandas as pd import numpy as np df = pd.DataFrame([('falcon',…

pandas dask python-xarray dask-dataframe

asked Mar 28 '20 at 01:38

Ray Bell

1,508
4
18
45

Questions tagged [dask-dataframe]