Dask: convert a dask.DataFrame to an xarray.Dataset

Question

This is possible in pandas.

I would like to do it with dask.

Edit: raised on dask here

FYI you can go from an xarray.Dataset to a Dask.DataFrame

Pandas solution using .to_xarry:

import pandas as pd
import numpy as np

df = pd.DataFrame([('falcon', 'bird', 389.0, 2),
                   ('parrot', 'bird', 24.0, 2),
                   ('lion', 'mammal', 80.5, 4),
                   ('monkey', 'mammal', np.nan, 4)],
                  columns=['name', 'class', 'max_speed',
                           'num_legs'])

df.to_xarray()
<xarray.Dataset>
Dimensions:    (index: 4)
Coordinates:
  * index      (index) int64 0 1 2 3
Data variables:
    name       (index) object 'falcon' 'parrot' 'lion' 'monkey'
    class      (index) object 'bird' 'bird' 'mammal' 'mammal'
    max_speed  (index) float64 389.0 24.0 80.5 nan
    num_legs   (index) int64 2 2 4 4

Dask solution?

import dask.dataframe as dd

ddf = dd.from_pandas(df, 1)

?

Could look a a solution using xarray but i think it only has .from_dataframe.

import xarray as xr

ds = xr.Dataset()
ds.from_dataframe(ddf.compute())

Ayrton Bourn · Answer 1 · 2021-05-19T19:54:59.557

So this is possible and I've made a PR here that achieves it - https://github.com/pydata/xarray/pull/4659

It provides two methods Dataset.from_dask_dataframe and DataArray.from_dask_series.

The main reason behind not merging yet is that we're trying to compute the chunk sizes with as few computations of dask as possible.

There's some more context in these issues: https://github.com/pydata/xarray/issues/4650, https://github.com/pydata/xarray/issues/3929

n4321d · Answer 2 · 2021-05-19T00:40:06.640

I was looking for something similar and created this function (it is not perfect, but it works pretty well). It also keeps all the dask data as dask arrays which saves memory etc.

import xarray as xr
import dask.dataframe as dd
        
def dask_2_xarray(ddf, indexname='index'):
     ds = xr.Dataset()
     ds[indexname] = ddf.index
     for key in ddf.columns:
         ds[key] = (indexname, ddf[key].to_dask_array().compute_chunk_sizes())
     return ds
            
# use:
ds = dask_2_xarray(ddf)

Example:

path = LOCATION TO FILE
ddf_test = dd.read_hdf(path, key="/data*", sorted_index=True, mode='r')
ds = dask_2_xarray(ddf_test, indexname="time")
ds

Result:

Most time is spent computing the chunks sizes, so if somebody knows a better way to do that, it will be faster.

you could add this to https://github.com/pydata/xarray/issues/3929 — Ray Bell, May 19 '21 at 02:05

score 1 · Answer 3 · answered Mar 28 '20 at 17:24

1

This method doesn't currently exist. If you think that it should exist then I encourage you to raise a github issue as a feature request. You might want to tag some Xarray people though.

answered Mar 28 '20 at 17:24

MRocklin

55,641
23
163
235

1

Has anyone made an issue for this? – Ryan Dec 03 '20 at 16:18
@Ryan just added to post – Ray Bell Jan 12 '21 at 02:48

Dask: convert a dask.DataFrame to an xarray.Dataset

3 Answers3