5

If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows:

df['year'] = df['date'].dt.year

With a dask dataframe, that does not work. If I compute first, like this:

df['year'] = df['date'].compute().dt.year

I get ValueError: Not all divisions are known, can't align partitions. Please useset_indexorset_partitionto set the index.

But if I do:

df['date'].head().dt.year

it works fine!

So how do I get the year (or week) of a datetime series in a dask dataframe?

user1566200
  • 1,826
  • 4
  • 27
  • 47
  • You might want to try creating an [minimal complete verifiable example](https://stackoverflow.com/help/mcve). – MRocklin Mar 15 '17 at 00:39

1 Answers1

8

The .dt datetime namespace is present on Dask series objects. Here is a self-contained of its use:

In [1]: import pandas as pd

In [2]: df = pd.util.testing.makeTimeSeries().to_frame().reset_index().head(10)

In [3]: df  # some pandas data to turn into a dask.dataframe
Out[3]: 
       index         0
0 2000-01-03 -0.034297
1 2000-01-04 -0.373816
2 2000-01-05 -0.844751
3 2000-01-06  0.924542
4 2000-01-07  0.507070
5 2000-01-10  0.216684
6 2000-01-11  1.191743
7 2000-01-12 -2.103547
8 2000-01-13  0.156629
9 2000-01-14  1.602243

In [4]: import dask.dataframe as dd

In [5]: ddf = dd.from_pandas(df, npartitions=3)

In [6]: ddf['year'] = ddf['index'].dt.year  # use the .dt namespace

In [7]: ddf.head()
Out[7]: 
       index         0  year
0 2000-01-03 -0.034297  2000
1 2000-01-04 -0.373816  2000
2 2000-01-05 -0.844751  2000
3 2000-01-06  0.924542  2000
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    Thanks! But what if my dataframe is already a dask dataframe, and I want to turn the date into year? I.e., in your example if `df` is already a Pandas dataframe. – user1566200 Mar 15 '17 at 11:27
  • In the example above we create a dask dataframe, `ddf`, from the pandas dataframe `df` using `dd.from_pandas`. – MRocklin Mar 15 '17 at 12:12
  • 1
    I think the comment raised the point that `ddf['year'] = df['index'].dt.year` should be `ddf['year'] = ddf['index'].dt.year`. Note the change between `df["index"]` and `ddf["index"]`. – Thomas Aug 27 '20 at 15:28