-3

I want to do 2 things:

  1. I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).

  2. Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.

Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):

Date                                        
2016-01-02  NumericIndex                        1611664
            OriginMovementID                       4744
            DestinationMovementID                  5084
            MeanTravelTimeSeconds                  1233
            RangeLowerBoundTravelTimeSeconds        756
                                                 ...   
2020-03-31  DestinationMovementID                  3594
            MeanTravelTimeSeconds                  1778
            RangeLowerBoundTravelTimeSeconds       1601
            RangeUpperBoundTravelTimeSeconds       1973
            DayOfWeek                           Tuesday
Length: 11281655, dtype: object

When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.

If I try to do df.stack().unstack() or df.stack().T I get then following error:

Index contains duplicate entries, cannot reshape

How do I plot the boxplot and how do I turn the rows into columns?

Luiz Scheuer
  • 305
  • 1
  • 10
  • I don't understand how you ended up with a dataframe shaped like this. Could you provide mockup data, in particular check out [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Diziet Asahi Aug 26 '20 at 14:35
  • What I had initially was a normal df with the "automatically generated" numeric index. I had a column called 'Date' where each row was a datetime object and each date was repeated multiple times (repeated for however many times there were values in other columns equivalent to that date). How I got to the dataframe you asked about: What I did afterwards, in order: 1) df.set_index(['Date']) 2) df.stack(). Each of the rows you see were my original columns. I want to have them back as columns, but leave the dates on the left the way they are right now. Having one single date for many values. – Luiz Scheuer Aug 26 '20 at 16:01

1 Answers1

1

You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.

import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
 "RangeLowerBoundTravelTimeSeconds"]

df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)], 
  "Value":random.randint(1000,10000)} 
 for i in range(random.randint(5,20)) 
 for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])

# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])

# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
    if i==0 or p!=v: s=0
    else: s+=1
    seq[i] = s
    p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())

output enter image description here

                                 Value                                                                                     
Observation      DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date       SeqNo                                                                                                           
2016-01-02 0                       NaN                   NaN       2560.0           5324.0                           5085.0
           1                       NaN                   NaN       1066.0           7372.0                              NaN
2016-01-16 0                       NaN                6226.0          NaN           7832.0                              NaN
           1                       NaN                1384.0          NaN           8839.0                              NaN
           2                       NaN                7892.0          NaN              NaN                              NaN
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • This is amazing, I can't thank you enough. I will read through it carefully tomorrow, but a quick read through already makes me thing this is it. Cheers! – Luiz Scheuer Aug 26 '20 at 21:35