Questions tagged [dask-dataframe]

403 questions
6
votes
1 answer

Read group of rows from Parquet file in Python Pandas / Dask?

I have a Pandas dataframe that looks similar to this: datetime data1 data2 2021-01-23 00:00:31.140 a1 a2 2021-01-23 00:00:31.140 b1 b2 2021-01-23 00:00:31.140 c1 c2 2021-01-23 00:01:29.021 d1 …
Mike
  • 155
  • 2
  • 8
6
votes
2 answers

How to create unique index in Dask DataFrame?

Imagine I have a Dask DataFrame from read_csv or created another way. How can I make a unique index for the dask dataframe? Note: reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…
Spar
  • 463
  • 1
  • 5
  • 23
5
votes
1 answer

Apply a function over the columns of a Dask array

What is the most efficient way to apply a function to each column of a Dask array? As documented below, I've tried a number of things but I still suspect that my use of Dask is rather amateurish. I have a quite wide and quite long array, in the…
5
votes
1 answer

Implement Equal-Width Intervals feature engineering in Dask

In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals. For example, given the…
ps0604
  • 1,227
  • 23
  • 133
  • 330
5
votes
0 answers

Dask distributed KeyError

I am trying to learn Dask using a small example. Basically I read in a file and calculate row means. from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=4, memory='24 GB') cluster.scale(4) from dask.distributed import Client client…
5
votes
1 answer

Efficiently read big csv file by parts using Dask

Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first…
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
4
votes
1 answer

Setting maximum number of workers in Dask map function

I have a Dask process that triggers 100 workers with a map function: worker_args = .... # array with 100 elements with worker parameters futures = client.map(function_in_worker, worker_args) worker_responses = client.gather(futures) I use docker…
ps0604
  • 1,227
  • 23
  • 133
  • 330
4
votes
2 answers

Is there a way to traverse through a dask dataframe backwards?

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?
Anina Hitt
  • 61
  • 3
4
votes
1 answer

Get column value after searching for row in dask

I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3. Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…
Tanmay Bhatnagar
  • 2,330
  • 4
  • 30
  • 50
4
votes
1 answer

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes. In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask. def merge_dfs(df1, df2,…
Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34
4
votes
1 answer

Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'

I'm working with a Dask Cluster on GCP. I'm using this code to deploy it: from dask_cloudprovider.gcp import GCPCluster from dask.distributed import Client enviroment_vars = { 'EXTRA_PIP_PACKAGES': '"gcsfs"' } cluster = GCPCluster( …
4
votes
2 answers

How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

Goal I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column. CODE mydtype = {'col1': 'object', 'col2': 'object', 'col3': 'object', 'col4': 'float32',} do =…
sogu
  • 2,738
  • 5
  • 31
  • 90
4
votes
1 answer

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
4
votes
0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…
Michael Wheeler
  • 849
  • 1
  • 10
  • 29
4
votes
3 answers

Dask: convert a dask.DataFrame to an xarray.Dataset

This is possible in pandas. I would like to do it with dask. Edit: raised on dask here FYI you can go from an xarray.Dataset to a Dask.DataFrame Pandas solution using .to_xarry: import pandas as pd import numpy as np df = pd.DataFrame([('falcon',…
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
1
2 3
26 27