2

I have to create a categorical variable out of pandas date-time index and looking for a pythonic way for it.

Till now i just looped through all index and did a bunch of if-else. I tried using, taking inspiration from (Adding a new pandas column with mapped value from a dictionary), a dictionary of lambda if else function and use map for creating a categorical function, but it didn't work

date_series = pd.date_range(start = '2010-12-31', end = '2018-12-31', freq = 'M')

regime_splitter = {lambda x : x < '2012' : 'before 2012' , lambda x : x>= '2012' and x < '2014': '2012 - 2014', lambda x : x>= '2014' : 'after 2014'}

date_series.map(regime_splitter)

expected result

         date              regime
0  2010-12-31         before 2012
1  2013-05-31  between 2012, 2014
2  2018-12-31          after 2014
dgomzi
  • 106
  • 1
  • 14

2 Answers2

2

Use cut with DatetimeIndex.year for solution if need add/remove more groups:

a = pd.cut(date_series.year, 
       bins=[-np.inf, 2012, 2014, np.inf], 
       labels=['before 2012','2012 - 2014','after 2014'])
print (a.value_counts())
before 2012    25
2012 - 2014    24
after 2014     48
dtype: int64

Another solution with numpy.select:

x = date_series.year
a = np.select([x <= 2012, x>= 2014], ['before 2012','after 2014'], '2012 - 2014')

print (pd.Series(a).value_counts())
after 2014     60
before 2012    25
2012 - 2014    12
dtype: int64

Your solution should be changed with nested if-else, but if large data it should be slow:

regime_splitter = (lambda x: 'before 2012' if x <= 2012 else 
                             ('2012 - 2014' if x>= 2012 and x <= 2014 else 'after 2014'))

a = date_series.year.map(regime_splitter)
print (a.value_counts())
after 2014     48
before 2012    25
2012 - 2014    24
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks, First one fits my requirement perfectly. I am create an interactive app where user can send comma separated timeseries_break_points and analyse in different regimes. something like `may-2012,june-2013,aug-2014,sep-2016`. I am using a ipytwidgets.Textarea to take input and then applying `pd.cut` for creating a new column `bins= [date_series.min()] + list(pd.to_datetime(regime_break_points.split(','))) + [date_series.max()]` – dgomzi Jul 09 '19 at 13:38
1
import pandas as pd
data_series = pd.date_range(start='2010-12-31', end='2018-12-31', freq='M')
df = pd.DataFrame(data_series, columns=['Dates'])
 
def regime_splitter(value):
    if value < pd.to_datetime('2012-01-01'):
        return 'before 2012'
    elif value > pd.to_datetime('2014-12-31'):
        return'After 2014'
    else:
        return 'Between 2012, 2014'
 
df['regime_splitter'] = df['Dates'].apply(regime_splitter)
 
df.head(15)
 
Dates     regime_splitter
0              2010-12-31           before 2012
1              2011-01-31           before 2012
2              2011-02-28           before 2012
3              2011-03-31           before 2012
4              2011-04-30           before 2012
5              2011-05-31           before 2012
6              2011-06-30           before 2012
7              2011-07-31           before 2012
8              2011-08-31           before 2012
9              2011-09-30           before 2012
10           2011-10-31           before 2012
11           2011-11-30           before 2012
12           2011-12-31           before 2012
13           2012-01-31           Between 2012, 2014
14           2012-02-29           Between 2012, 2014