0

I want to create a stacked histogram like below.

enter image description here

Here is my code:

import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt


def stackhist(x, y):
    grouped = pd.groupby(x, y)
    data = [d for _, d in grouped]
    labels = [l for l, _ in grouped]
    plt.figure(figsize=(20, 10))
    plt.hist(data, histtype="bar", stacked="True", label=labels)
    plt.legend()

# make data distribution
mu, sigma = 12.2, 1.2
distribution = np.random.normal(mu, sigma, 200)

times = [(datetime.time(hour=int(x), minute=int((x - int(x))*60.0), second=int(((x - int(x)) * 60 - int((x - int(x))*60.0))*60.0))).strftime('%H:%M:%S') for x in distribution]

df = pd.DataFrame(columns=['time', 'department'])
df.time = times

df['department'] = df['department'].fillna(pd.Series(np.random.choice(['Shoes', 'Hats', 'Shirts', 'Pants'],
                                                                      p=[0.1, 0.15, 0.375, 0.375], size=len(df))))

stackhist(df['time'], df['department'])

plt.show()

Here is the output notice that the X label is all the different times stacked. How can I make it just be the hours as in 10-11-12-13-14-15-16 and not the minutes:

I noticed the 'hats' typo

thank you for your attention.

ak_slick
  • 1,006
  • 6
  • 19
mrbTT
  • 1,399
  • 1
  • 18
  • 31
  • Please supply the data frame so we don't have to remake it? Also, please supply an attempt. You have already put in so much effort with this question. It would be cleaner to supply the code you are using in a full script rather than links to the examples you used., since this doesn't show us exactly how you applied it. Also it leaves us to rebuild everything. – ak_slick Jul 25 '18 at 00:17
  • The thing is I failed so much on that histogram that I don't know where to start... I'll try with your answer bellow – mrbTT Jul 25 '18 at 12:52

1 Answers1

1

Your first real issue here is that you are not in datetime.time elements from following your linked data. You end up with strings of time which matplotlib will treat as categorical and not do what you want.

This demonstrates how to fix your times. And gets you this plot.

Let me know if this makes sense.

import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt


def stackhist(x, y):
    grouped = pd.groupby(x, y)
    data = [d for _, d in grouped]
    labels = [l for l, _ in grouped]
    plt.figure(figsize=(20, 10))
    plt.hist(data, histtype="bar", stacked="True", label=labels)
    plt.legend()


mu, sigma = 12.2, 1.2
distribution = np.random.normal(mu, sigma, 1000)

# only pull the hour from the datetime time
times = [(datetime.time(hour=int(x), minute=int((x - int(x))*60.0), second=int(((x - int(x)) * 60 - int((x - int(x))*60.0))*60.0))).strftime('%H') for x in distribution]

# make data frame since you used one
df = pd.DataFrame(columns=['time', 'department'])
df.time = times

# set times to integer instead of string so they will sort automatically
df['time'] = df['time'].astype(int)

# fill department data
df['department'] = df['department'].fillna(pd.Series(np.random.choice(['Shoes', 'Hats', 'Shirts', 'Pants'],
                                                                      p=[0.1, 0.15, 0.375, 0.375], size=len(df))))

stackhist(df['time'], df['department'])

plt.show()

enter image description here

ak_slick
  • 1,006
  • 6
  • 19
  • Thank you for your input. So, printing the second time without the strftime shows me: "[datetime.time(10, 1, 20), datetime.time(10, 2, 36) ..." instead of "10:1:20, 10:2:36" is this correct? – mrbTT Jul 25 '18 at 12:51
  • Can't edit the comment above, so i'll add another... after converting to datetime, I've added the array into a pd.Dataframe `df = pd.DataFrame(data=times, columns=["arrival"])`. When getting the `df.info()` it returned me it as an object, is this right? I'm not finding a data type specific for only hours minutes and seconds... – mrbTT Jul 25 '18 at 13:02
  • I had to revert this change because when I ran the code I've added on my question it returned the error `ValueError: microsecond must be in 0..999999 python` which it doesn't occur with the `strftime` – mrbTT Jul 25 '18 at 13:36
  • Hey, first of all, I'll mark this as the answer, thank you very much. Second, I've modified your solution, instead of converting the time for only hour, I've removed the seconds, converted the ":" to "." and casted it as float, so that the transition on the X plane was smoother. Here is the code `df['arrivalfloat'] = grafico['arrival'].str[:-3].str.replace(":",".").astype(float)` – mrbTT Jul 25 '18 at 14:28
  • Excellent. If you want to optimize further only take this for the hours. Int(x) skips all the date time stuff. – ak_slick Jul 25 '18 at 14:34