4

So I'm having a little trouble ordering my Dataframe, I've tried using this question, but haven't managed to get it to work. What I have is a Dataframe nudf like so:

                     date  level_1      0 
0     2016-10-01 00:00:00      0.0  74.00    
1     2016-10-01 00:30:00      0.5     72    
2     2016-10-01 01:00:00      1.0     70    
3     2016-10-01 01:30:00      1.5     64    
4     2016-10-01 02:00:00      2.0     63    
5     2016-10-01 02:30:00      2.5     60    
...                   ...      ...    ...   
19003 2017-09-31 21:30:00     21.5    129    
19004 2017-09-31 22:00:00     22.0    118    
19005 2017-09-31 22:30:00     22.5    106  
19006 2017-09-31 23:00:00     23.0     84    
19007 2017-09-31 23:30:00     23.5     76    

And what I would like to do is order the rows by an external month order:

[4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]

This is the last 12 months, from last month. I'd like to ignore the year and order each month block of rows in accordance with the order above.

For example, given the following rows:

0     2016-02-01 00:00:00      0.0  74.00    
1     2016-02-01 00:30:00      0.5     72    
2     2016-03-01 01:00:00      1.0     70    
3     2016-03-01 01:30:00      1.5     64    
4     2017-04-01 02:00:00      2.0     63    
5     2017-04-01 02:30:00      2.5     60  

The result should be:

4     2017-04-01 02:00:00      2.0     63    
5     2017-04-01 02:30:00      2.5     60
0     2016-02-01 00:00:00      0.0  74.00    
1     2016-02-01 00:30:00      0.5     72    
2     2016-03-01 01:00:00      1.0     70    
3     2016-03-01 01:30:00      1.5     64      

I've tried:

nudf['month'] = nudf.apply(lambda row: row.date.month, axis=1)
nudf.month = nudf.month.astype("category")
nudf.month.cat.set_categories([x.month for x in reversed(_get_last_x_months(12))], inplace=True)

nudf.sort_values(["month"], inplace=True)

But the day and hour order is not maintained.

Darkstarone
  • 4,590
  • 8
  • 37
  • 74

3 Answers3

3

You can use a separate categorical, argsort, and iloc
Also, notice I used kind='mergesort' because mergesort is a "stable" sorting algorithm and will maintain relative order for equivalently valued rows.

mcats = [4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]
nudf.iloc[pd.Categorical(nudf.date.dt.month, mcats, True).argsort(kind='mergesort')]

                 date  level_1     0
4 2017-04-01 02:00:00      2.0  63.0
5 2017-04-01 02:30:00      2.5  60.0
0 2016-02-01 00:00:00      0.0  74.0
1 2016-02-01 00:30:00      0.5  72.0
2 2016-03-01 01:00:00      1.0  70.0
3 2016-03-01 01:30:00      1.5  64.0

You can add the column as well

mcats = [4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]
nudf = nudf.assign(month=pd.Categorical(nudf.date.dt.month, mcats, True))
nudf.sort_values('month', kind='mergesort')

                 date  level_1     0 month
4 2017-04-01 02:00:00      2.0  63.0     4
5 2017-04-01 02:30:00      2.5  60.0     4
0 2016-02-01 00:00:00      0.0  74.0     2
1 2016-02-01 00:30:00      0.5  72.0     2
2 2016-03-01 01:00:00      1.0  70.0     3
3 2016-03-01 01:30:00      1.5  64.0     3

If we are trying to sort by month then by date, we don't need to specify a stable sort and just sort by two columns

mcats = [4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]
nudf = nudf.assign(month=pd.Categorical(nudf.date.dt.month, mcats, True))
nudf.sort_values(['month', 'date'])

Or instead of the argsort answer, we can use np.lexsort to return an ordering permutation based off of multiple arrays.

mcats = [4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]
nudf.iloc[np.lexsort(
    [nudf.date, pd.Categorical(nudf.date.dt.month, mcats, True)]
)]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Another one of those handy pandas things I never knew about. very interesting answer! I was wondering how the order worked... i'll give it a test – MattR Apr 06 '18 at 15:26
  • As you can see [here](https://gist.github.com/lutrasdebtra/4d811cd8baffd1cef9243e6dd907dd3b), the order is not correct for the larger file I have. – Darkstarone Apr 06 '18 at 15:26
  • I altered my post to specify a stable sorting algorithm. See if that helps. – piRSquared Apr 06 '18 at 15:28
  • Works perfectly with the mergesort addition. Thanks so much. – Darkstarone Apr 06 '18 at 15:37
1

You can accomplish the ordering by using %. But to get your desired output, first sort by datetime.

nudf.sort_values(by='date', inplace=True)

mcats = [x.month for x in reversed(_get_last_x_months(12))]
#[4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3]
nudf['m_fake'] = (nudf.date.dt.month+(12-mcats[0]))%12
nudf.sort_values(by='m_fake')

#                 date  val1  val2  m_fake
#4 2017-04-01 02:00:00   2.0    63       0
#5 2017-04-01 02:30:00   2.5    60       0
#0 2016-02-01 00:00:00   0.0    74       10
#1 2016-02-01 00:30:00   0.5    72       10
#2 2016-03-01 01:00:00   1.0    70       11
#3 2016-03-01 01:30:00   1.5    64       11
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • The ordering is generated by `[x.month for x in reversed(_get_last_x_months(12))]`, so I'd ideally like to use that. – Darkstarone Apr 06 '18 at 15:21
  • okay, then you'd just need to do 12-order[0] to never have to worry about it again – ALollz Apr 06 '18 at 15:23
  • As can see [here](https://gist.github.com/lutrasdebtra/46a0c1ac47ed12cdff10630da456c4de), on the larger dataset, some days are out of order. – Darkstarone Apr 06 '18 at 15:29
  • That's because in the large file they aren't sorted by date, while in the example they were. I just updated to sort first based on date and then apply the month ordering. That should do it. – ALollz Apr 06 '18 at 15:31
  • Yeah that was my mistake, your answer does also appear to work - thanks for your help, the other answer came in first, but this is also a good solution. – Darkstarone Apr 06 '18 at 15:38
1

you can use map to change the value and re-order the columns

# creates an int value based on the date using .dt.month (must be a date type)
df['month_value'] = df['date'].dt.month

# creates a dictionary that will remap the values
new_order = {4:1, 5:2, 6:3, 7:4, 8:5, 9:6, 10:7, 11:8, 12:9, 1:10, 2:11, 3:12}

# creates a new column based on the mapping
df['new_value'] = df['month_value'].map(new_order)

# sorts the values based on the new column
df.sort_values(by='new_value')

           date  month_value  new_value
4 2017-04-01            4          1
5 2017-04-01            4          1
0 2016-02-01            2         11
1 2016-02-01            2         11
2 2016-03-01            3         12
3 2016-03-01            3         12
MattR
  • 4,887
  • 9
  • 40
  • 67