3

I'd like to create a DateTimeIndex in Pandas from broadcasted arrays of years, months, days, hours, etc. This is relatively straightforward to do via a list comprehension; e.g.

import numpy as np
import pandas as pd

def build_DatetimeIndex(*args):
    return pd.DatetimeIndex([pd.datetime(*tup)
                             for tup in np.broadcast(*args)])

For example:

>>> year = 2012
>>> months = [1, 2, 5, 6]
>>> days = [1, 15, 1, 15]
>>> build_DatetimeIndex(year, months, days)
DatetimeIndex(['2012-01-01', '2012-02-15', '2012-05-01', '2012-06-15'], 
              dtype='datetime64[ns]', freq=None)

But due to the list comprehension, this becomes rather slow as the size of the inputs grow. Is there a built-in way to do this in Pandas, or is there any way to define build_DatetimeIndex in terms of a fast, vectorized operation?

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • Your implementation looks very efficient. What makes you think the list comprehension slows it down? – Alexander Apr 22 '16 at 23:51
  • The main reason I suspect it's slow is in analogy to other vectorized operations. For example ``[np.add(*args) for args in zip(a, b)]`` is about 1000x slower than ``np.add(a, b)`` – jakevdp Apr 23 '16 at 03:04
  • @jakevdp https://github.com/pydata/pandas/pull/12967 here an example of some syntax that I think is reasonable to include in ``.to_datetime`` to do this assemblage. – Jeff Apr 23 '16 at 14:39
  • Thanks Jeff – confirms that there's currently no built-in function for what I want to do :) – jakevdp Apr 24 '16 at 18:56

4 Answers4

3

You can use dtypes m8[Y], m8[M], m8[D] to make Timedeltas arrays, and add them together to the date: "0000-01-01":

import pandas as pd
import numpy as np

year = np.arange(2010, 2020)
months = np.arange(1, 13)
days = np.arange(1, 29)

y, m, d = map(np.ravel, np.broadcast_arrays(*np.ix_(year, months, days)))

start = np.array(["0000-01-01"], dtype="M8[Y]")

r1 = start + y.astype("m8[Y]") + (m - 1).astype("m8[M]") + (d-1).astype("m8[D]")

def build_DatetimeIndex(*args):
    return pd.DatetimeIndex([pd.datetime(*tup)
                             for tup in np.broadcast(*args)])

r2 = build_DatetimeIndex(y, m, d)

np.all(pd.DatetimeIndex(r1) == r2)

To include hours , minutes, seconds:

import pandas as pd
import numpy as np

y = np.array([2012, 2013])
m = np.array([1, 3])
d = np.array([5, 20])
H = np.array([10, 20])
M = np.array([30, 40])
S = np.array([0, 30])

start = np.array(["0000-01-01"], dtype="M8[Y]")

date = start + y.astype("m8[Y]") + (m - 1).astype("m8[M]") + (d-1).astype("m8[D]")
datetime = date.astype("M8[s]") + H.astype("m8[h]") + M.astype("m8[m]") + S.astype("m8[s]")

pd.Series(datetime)

the result:

0   2012-01-05 10:30:00
1   2013-03-20 20:40:30
dtype: datetime64[ns]
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • Thanks! To be honest, I was hoping there'd be a fast one-line built-in for this, but in lieu of that this is a really nice vectorized solution! – jakevdp Apr 24 '16 at 18:57
2

This is just to close the loop, and gives one example of the pd.to_datetime functionality that was developed as Jeff indicated in https://github.com/pydata/pandas/pull/12967.

pd.to_datetime will work with or without columns for years, months, days, etc. in a DataFrame. (See the Github discussion for examples with existing columns.)

Based on the example, the DatetimeIndex is being created without any existing columns with years, month, days, etc. in a DataFrame. This is possible.

import numpy as np
import pandas as pd

datedict = {'year':  [2012]*4, # Length must equal 'month' and 'day' length
            'month': [1, 2, 5, 6], 
            'day':   [1, 15, 1, 15]}
pd.DatetimeIndex(pd.to_datetime(datedict))
DatetimeIndex(['2012-01-01', '2012-02-15', '2012-05-01', '2012-06-15'], 
              dtype='datetime64[ns]', freq=None)
Nick Brown
  • 31
  • 1
  • 3
1

Solution

import numpy as np
import pandas as pd

def build_DatetimeIndex(years, months, days):
    years = pd.Index(years, name='year')
    months = pd.Index(months, name='month')
    days = pd.Index(days, name='day')

    panel = pd.Panel(items=days, major_axis=years, minor_axis=months)

    to_dt = lambda x: pd.datetime(*x)
    series = panel.fillna(0).to_frame().stack().index.to_series()

    return pd.DatetimeIndex(series.apply(to_dt))

Demonstration

dti = build_DatetimeIndex(range(1900, 2000), range(1, 13), [1, 15])

print dti

DatetimeIndex(['1900-01-01', '1900-01-15', '1900-02-01', '1900-02-15',
               '1900-03-01', '1900-03-15', '1900-04-01', '1900-04-15',
               '1900-05-01', '1900-05-15',
               ...
               '1999-08-01', '1999-08-15', '1999-09-01', '1999-09-15',
               '1999-10-01', '1999-10-15', '1999-11-01', '1999-11-15',
               '1999-12-01', '1999-12-15'],
              dtype='datetime64[ns]', length=2400, freq=None)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • How do you include hours and minutes, and have you timed your result against the OP's? – Alexander Apr 22 '16 at 23:51
  • @Alexander I'll time it now. Hours, minutes would take some modification. My mind was focused on years, months, days. Good questions though. Get's me thinking. – piRSquared Apr 22 '16 at 23:54
  • Can I trouble you to look at this...? https://stackoverflow.com/a/46190648/4909087 – cs95 Sep 13 '17 at 06:59
  • Feel free to trouble me as much as you'd like. Look'd good to me. – piRSquared Sep 13 '17 at 07:03
  • Haha thank you :-) I'll try to keep it to a minimum. One more thing... I wanted to thank you again for your email. It was very meaningful. I hope you received my response too. – cs95 Sep 13 '17 at 07:03
  • I did and you're welcome. Don't hesitate to reach out via email if you want as well. I'm often very busy but I'm happy to respond when I have a chance. And I'm not shy about reaching out myself If I need anything. – piRSquared Sep 13 '17 at 07:06
  • Can I get you to weigh in on this one? https://stackoverflow.com/a/46192213/2336654 – piRSquared Sep 13 '17 at 08:10
1

Another Solution

import pandas as pd
import numpy as np

def nao(*args):
    if len(args) == 1:
        return np.asarray(args[-1]).flatten()
    else:
        return np.add.outer(args[-1], nao(*args[:-1]) * 1e2).flatten()

def handler(*args):
    fmt = np.array(['%Y', '%m', '%d', '%H', '%M', '%S'])
    fstr = "".join(fmt[range(len(args))])
    ds = nao(*args).astype(np.dtype(int))
    return pd.Index(pd.Series(ds).apply(lambda x: pd.datetime.strptime(str(x), fstr)))

Demonstration

handler(range(1900, 2000), range(1, 13), range(1, 28))

DatetimeIndex(['1900-01-01', '1901-01-01', '1902-01-01', '1903-01-01',
               '1904-01-01', '1905-01-01', '1906-01-01', '1907-01-01',
               '1908-01-01', '1909-01-01',
               ...
               '1990-12-27', '1991-12-27', '1992-12-27', '1993-12-27',
               '1994-12-27', '1995-12-27', '1996-12-27', '1997-12-27',
               '1998-12-27', '1999-12-27'],
              dtype='datetime64[ns]', length=32400, freq=None)

Timed test

stamp = pd.datetime.now()
for _ in range (10):
    handler(range(1900, 2000), range(1, 13), range(1, 28))
print pd.datetime.now() - stamp

0:00:04.870000
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Interesting – but I wasn't necessarily aiming for an outer product. I just want to broadcast the inputs together. – jakevdp Apr 23 '16 at 03:07
  • 1
    Ah! I misunderstood. My other answer assumes an outer product as well. Question now is: do I delete it. – piRSquared Apr 23 '16 at 03:11