0
jupyter notebook 5.2.2
Python 3.6.4
pandas 0.22.0
matplotlib 2.2.2

Hi I'm trying to present and format a histogram in a jupyter notebook based on hour and minute log data retrieved from a hadoop store using Hive SQL.

I'm having problems with the presentation. I'd like to be able to set the axes from 00:00 to 23:59 with the bins starting at zero and ending at the next minute. I'd like half hourly tick marks. I just can't see how to do it.

The following pulls back 2 years data with 1440 rows and the total count of events at each minute.

%%sql -o jondat
select eventtime, count(1) as cnt
from logs.eventlogs
group by eventtime

The data is stored as a string but is hour and minute hh:mm, however it appears to be being auto converted as sysdate plus timestamp by the notebook, I have been playing with the data in this format and others.

If I strip out the colons I get

df.dtypes

eventtime int64
cnt int64

and if I use a dummy filler like a pipe I get

eventtime object
cnt int64

If I leave the colon in with colons I get

eventtime datetime64
cnt int64

which is what I am currently using.

...
2018-11-22 00:27:00 32140
2018-11-22 00:28:00 32119
2018-11-22 00:29:00 31726
...
2018-11-22 23:30:00 47989
2018-11-22 23:31:00 40019
2018-11-22 23:32:00 40962
...

I can then plot the data

%%local

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
import mateplotlib.dates as md

xtformat = md.DateFormatter('%H:%M')

plt.rcParams['figure.figsize'] = [15,10]
df = pd.DataFrame(jondat)

x=df['eventtime']
b=144
y=df['cnt']

fig, ax=plt.subplots()

ax.xaxis_date()

ax.hist(x,b,weights=y)
ax.xaxis.set_major_formatter(xtformat)

plt.show(ax)

Currently my axes start well before and after the data and the bins are centered over the minute which is more of a pain if I change the number of bin. I can't see where to stop the auto-conversion from string to datetime and I'm not sure if I need to in order to get the result I want.

Is this about formatting my eventtime and setting the axes or can I just set the axes easily irrespective of the data type. Ideally the labelled ticks would be user friendly

This is the chart I get with 144 bins. As some of the log records are manual the 1440 bin chart is "hairy" due to the tendency for the manual records being rounded. One of the things I am experimenting with is different bin counts.

j_4321
  • 15,431
  • 3
  • 34
  • 61
JonB65
  • 39
  • 7
  • I looks to me that the data is already binned. So instead of `plt.hist` you might want a bar plot like `plt.bar(x,y, width=1/24/60, align="edge")`. Then the remaining question is to get the half-hour tickmarks, which would be done as shown e.g. [this question](https://stackoverflow.com/questions/42398264/matplotlib-xticks-every-15-minutes-starting-on-the-hour). – ImportanceOfBeingErnest Nov 23 '18 at 13:09
  • I've managed to get the HH:MM formatted using question in your link but I still get object of type MinuteLocator has no len(). I've added an example chart. – JonB65 Nov 26 '18 at 10:24
  • The code in the question has no bar plot and no MinuteLocator in it. Hence one cannot find out why you get this error. – ImportanceOfBeingErnest Nov 26 '18 at 19:41
  • I've edited the code to show what works, so it now matches the chart. The missing lines that fail are – JonB65 Nov 27 '18 at 16:25
  • Sorry, I've fixed the has no len() issue although I'm not clear what was wrong, I've edited the code so it now matches the chart. I've also got the tick intervals working by adding xlocator = md.MinuteLocator(byminute=[0], interval = 1) then ax.xaxis.set_major_locator(xtinter) and I can add minor ticks on the 30 minute adapting the same code. So thank you for your help and the link. The only part left is minor aesthetics, to make the chart start at 00:00 and end at 00:00, losing the space at either end. – JonB65 Nov 27 '18 at 16:38

1 Answers1

0

Thanks to https://stackoverflow.com/users/4124317/importanceofbeingernest who gave me enough clues to find the answer.

%%local

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
import mateplotlib.dates as md

plt.rcParams['figure.figsize'] = [15,10]
df = pd.DataFrame(jondat)

xtformat = md.DateFormatter('%H:%M')
xtinter = md.MinuteLocator(byminute=[0], interval=1)
xtmin = md.MinuteLocator(byminute=[30], interval=1)


x=df['eventtime']
b=144
y=df['cnt']

fig, ax=plt.subplots()

ld=min(df['eventtime'])
hd=max(df['eventtime'])

ax.xaxis_date()

ax.hist(x,b,weights=y)
ax.xaxis.set_major_formatter(xtformat)
ax.xaxis.set_major_locator(xtinter)
ax.xaxis.set_minor_locator(stmin)
ax.set_xlim([ld,hd])

plt.show(ax);

This lets me plot the chart tidily and play with the bin setting to see how much it impacts the curve both for presentation on a dashboard and to help think about categorization into time bands for analysis of even types by time.

JohanC
  • 71,591
  • 8
  • 33
  • 66
JonB65
  • 39
  • 7