Questions tagged [dask-dataframe]
403 questions
6
votes
1 answer
Read group of rows from Parquet file in Python Pandas / Dask?
I have a Pandas dataframe that looks similar to this:
datetime data1 data2
2021-01-23 00:00:31.140 a1 a2
2021-01-23 00:00:31.140 b1 b2
2021-01-23 00:00:31.140 c1 c2
2021-01-23 00:01:29.021 d1 …

Mike
- 155
- 2
- 8
6
votes
2 answers
How to create unique index in Dask DataFrame?
Imagine I have a Dask DataFrame from read_csv or created another way.
How can I make a unique index for the dask dataframe?
Note:
reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…

Spar
- 463
- 1
- 5
- 23
5
votes
1 answer
Apply a function over the columns of a Dask array
What is the most efficient way to apply a function to each column of a Dask array? As documented below, I've tried a number of things but I still suspect that my use of Dask is rather amateurish.
I have a quite wide and quite long array, in the…

chameau13
- 626
- 7
- 24
5
votes
1 answer
Implement Equal-Width Intervals feature engineering in Dask
In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals.
For example, given the…

ps0604
- 1,227
- 23
- 133
- 330
5
votes
0 answers
Dask distributed KeyError
I am trying to learn Dask using a small example. Basically I read in a file and calculate row means.
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=4, memory='24 GB')
cluster.scale(4)
from dask.distributed import Client
client…

Phoenix Mu
- 648
- 7
- 12
5
votes
1 answer
Efficiently read big csv file by parts using Dask
Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database).
Avoiding load all data in memory, I want to read by chunks of current size: read first…

Mikhail_Sam
- 10,602
- 11
- 66
- 102
4
votes
1 answer
Setting maximum number of workers in Dask map function
I have a Dask process that triggers 100 workers with a map function:
worker_args = .... # array with 100 elements with worker parameters
futures = client.map(function_in_worker, worker_args)
worker_responses = client.gather(futures)
I use docker…

ps0604
- 1,227
- 23
- 133
- 330
4
votes
2 answers
Is there a way to traverse through a dask dataframe backwards?
I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

Anina Hitt
- 61
- 3
4
votes
1 answer
Get column value after searching for row in dask
I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3.
Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…

Tanmay Bhatnagar
- 2,330
- 4
- 30
- 50
4
votes
1 answer
Merging on columns with dask
I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.
def merge_dfs(df1, df2,…

Eliran Turgeman
- 1,526
- 2
- 16
- 34
4
votes
1 answer
Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'
I'm working with a Dask Cluster on GCP. I'm using this code to deploy it:
from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import Client
enviroment_vars = {
'EXTRA_PIP_PACKAGES': '"gcsfs"'
}
cluster = GCPCluster(
…

Paula Vallejo
- 43
- 6
4
votes
2 answers
How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?
Goal
I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column.
CODE
mydtype = {'col1': 'object',
'col2': 'object',
'col3': 'object',
'col4': 'float32',}
do =…

sogu
- 2,738
- 5
- 31
- 90
4
votes
1 answer
Dask crashing when saving to file?
I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…

Lostsoul
- 25,013
- 48
- 144
- 239
4
votes
0 answers
Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…

Michael Wheeler
- 849
- 1
- 10
- 29
4
votes
3 answers
Dask: convert a dask.DataFrame to an xarray.Dataset
This is possible in pandas.
I would like to do it with dask.
Edit: raised on dask here
FYI you can go from an xarray.Dataset to a Dask.DataFrame
Pandas solution using .to_xarry:
import pandas as pd
import numpy as np
df = pd.DataFrame([('falcon',…

Ray Bell
- 1,508
- 4
- 18
- 45