4

This is possible in pandas.

I would like to do it with dask.

Edit: raised on dask here

FYI you can go from an xarray.Dataset to a Dask.DataFrame

Pandas solution using .to_xarry:

import pandas as pd
import numpy as np

df = pd.DataFrame([('falcon', 'bird', 389.0, 2),
                   ('parrot', 'bird', 24.0, 2),
                   ('lion', 'mammal', 80.5, 4),
                   ('monkey', 'mammal', np.nan, 4)],
                  columns=['name', 'class', 'max_speed',
                           'num_legs'])

df.to_xarray()
<xarray.Dataset>
Dimensions:    (index: 4)
Coordinates:
  * index      (index) int64 0 1 2 3
Data variables:
    name       (index) object 'falcon' 'parrot' 'lion' 'monkey'
    class      (index) object 'bird' 'bird' 'mammal' 'mammal'
    max_speed  (index) float64 389.0 24.0 80.5 nan
    num_legs   (index) int64 2 2 4 4

Dask solution?

import dask.dataframe as dd

ddf = dd.from_pandas(df, 1)

?

Could look a a solution using xarray but i think it only has .from_dataframe.

import xarray as xr

ds = xr.Dataset()
ds.from_dataframe(ddf.compute())
Ray Bell
  • 1,508
  • 4
  • 18
  • 45

3 Answers3

3

So this is possible and I've made a PR here that achieves it - https://github.com/pydata/xarray/pull/4659

It provides two methods Dataset.from_dask_dataframe and DataArray.from_dask_series.

The main reason behind not merging yet is that we're trying to compute the chunk sizes with as few computations of dask as possible.

There's some more context in these issues: https://github.com/pydata/xarray/issues/4650, https://github.com/pydata/xarray/issues/3929

Ayrton Bourn
  • 365
  • 5
  • 16
2

I was looking for something similar and created this function (it is not perfect, but it works pretty well). It also keeps all the dask data as dask arrays which saves memory etc.

import xarray as xr
import dask.dataframe as dd
        
def dask_2_xarray(ddf, indexname='index'):
     ds = xr.Dataset()
     ds[indexname] = ddf.index
     for key in ddf.columns:
         ds[key] = (indexname, ddf[key].to_dask_array().compute_chunk_sizes())
     return ds
            
# use:
ds = dask_2_xarray(ddf)

Example:

path = LOCATION TO FILE
ddf_test = dd.read_hdf(path, key="/data*", sorted_index=True, mode='r')
ds = dask_2_xarray(ddf_test, indexname="time")
ds

Result: Result Array In Jup. Lab

Most time is spent computing the chunks sizes, so if somebody knows a better way to do that, it will be faster.

n4321d
  • 1,059
  • 2
  • 12
  • 31
1

This method doesn't currently exist. If you think that it should exist then I encourage you to raise a github issue as a feature request. You might want to tag some Xarray people though.

MRocklin
  • 55,641
  • 23
  • 163
  • 235