Plotting grouped Pandas dataframe containing datetime into a single timeline

Question

I'm trying to parse a log file (specifically, from a Gradle build), which looks like this:

21:51:38.991 [DEBUG] [TestEventLogger] cha.LoginTest4 STARTED
21:51:39.054 [DEBUG] [TestEventLogger] cha.LoginTest2 STARTED
21:51:40.068 [DEBUG] [TestEventLogger] cha.LoginTest4 PASSED
21:51:40.101 [DEBUG] [TestEventLogger] cha.LoginTest2 PASSED
21:51:40.366 [DEBUG] [TestEventLogger] cha.LoginTest1 STARTED
21:51:40.413 [DEBUG] [TestEventLogger] cha.LoginTest3 STARTED
21:51:50.435 [DEBUG] [TestEventLogger] cha.LoginTest1 PASSED
21:51:50.463 [DEBUG] [TestEventLogger] cha.LoginTest3 PASSED
21:51:50.484 [DEBUG] [TestEventLogger] Gradle Test Run :test PASSED
21:51:38.622 [DEBUG] [TestEventLogger] Gradle Test Run :test STARTED

into a chart that shows a timeline of events. Sort of like this:

n |  ======= 
a |   === 
m |       == 
e |    ======= 
  |______________
     time

So far I have parsed the logs and put the relevant "events" into a Pandas dataframe (sorted by timestamp).

log events parsed, sorted and ungrouped:
                 timestamp            name
0 1900-01-01 21:51:38.622            test
0 1900-01-01 21:51:38.991  cha.LoginTest4
0 1900-01-01 21:51:39.054  cha.LoginTest2
0 1900-01-01 21:51:40.068  cha.LoginTest4
0 1900-01-01 21:51:40.101  cha.LoginTest2
0 1900-01-01 21:51:40.366  cha.LoginTest1
0 1900-01-01 21:51:40.413  cha.LoginTest3
0 1900-01-01 21:51:50.435  cha.LoginTest1
0 1900-01-01 21:51:50.463  cha.LoginTest3
0 1900-01-01 21:51:50.484            test

Because I need the start and end time for each "name", I do a groupby. I get groups that look like this:

group                 timestamp            name
0       1900-01-01 21:51:38.991  cha.LoginTest4
0       1900-01-01 21:51:40.068  cha.LoginTest4

There are always two rows, the first row is the start time and the last row is the end time. I was able to get as far as using hlines to show the timeline for each group. However, I'd like to get all groups into the same plot to see when they start/end in relation to each other. I'd still like to use groupby since it lets me get the start/end time together with the "name" with few lines of code.

I've only been able to get as far as showing a plot for each group, not all of them together, without getting errors. Here is what I did to show each plot:

for name, group in df.groupby('name', sort=False):

    group.amin = group['timestamp'].iloc[0] # assume sorted order
    group.amax = group['timestamp'].iloc[1]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax = ax.xaxis_date()
    ax = plt.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))

    plt.show()

Solved Full source:

import os
import re
import pandas as pd
from pandas import Timestamp
import matplotlib.pyplot as plt
import matplotlib.dates as dt
import warnings
from random import random
from matplotlib.pyplot import text
from datetime import datetime
import numpy as np

warnings.simplefilter(action='ignore', category=FutureWarning) # https://stackoverflow.com/a/46721064

'''
The log contents are not guaranteed to be in order. Multiple processes are dumping contents into a single file.
Contents from a single process will be in order.
'''

def main():

    log_file_path = "gradle-4.2.test.debug.log"

    # regex to get test and task log events
    test_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] (\S+[^:>]) (STARTED|PASSED|FAILED)$')
    task_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] Gradle Test Run [:](\S+) (STARTED|PASSED|FAILED)$')

    df = pd.DataFrame()
    with open(log_file_path, "r") as file:
        for line in file:
            test_match = test_re.findall(line)
            if test_match:
                df = df.append(test_match)
            else:
                task_match = task_re.findall(line)
                if task_match:
                    df = df.append(task_match)

    file.close()

    df.columns = ['timestamp','name','type']
    df.drop('type', axis=1, inplace=True) # don't need this col
    df['timestamp'] = pd.to_datetime(df.timestamp, format="%H:%M:%S.%f") # pandas datetime
    df =  df.sort_values('timestamp')  # sort by  pandas datetime

    print ("log events parsed, sorted and ungrouped:\n", df)

    fig, ax = plt.subplots()
    ax.xaxis_date()

    # Customize the major grid
    ax.minorticks_on()
    ax.grid(which='major', linestyle='-', linewidth='0.2', color='gray')

    i = 0 # y-coord will be loop iteration

    # Groupby name. Because the df was previously sorted, the tuple will be sorted order (first event, second event)
    # Give each group an hline.
    for name, group in df.groupby('name', sort=False):
        i += 1

        assert group['timestamp'].size == 2 # make sure we have a start & end time for each test/task
        group.amin = group['timestamp'].iloc[0] # assume sorted order
        group.amax = group['timestamp'].iloc[1]
        assert group.amin < group.amax # make sure start/end times are in order

        if '.' in name: # assume '.' indicates a JUnit test, not a task
            color = [(random(),random(),random())]
            linestyle = 'solid'
            ax.text(group.amin, (i + 0.05), name, color='blue') # add name to x, y+.05 to hline
        else: # a task.
            color = 'black'
            linestyle = 'dashed'
            ax.text(group.amin, (i + 0.05), name + ' (Task)', color='red') # add name to x, y+.05 to hline

        ax.hlines(i, dt.date2num(group.amin), dt.date2num(group.amax), linewidth = 6, color=color, linestyle=linestyle)

    # Turn off y ticks. These are just execution order (numbers won't make sense).
    plt.setp(ax.get_yticklabels(), visible=False)
    ax.yaxis.set_tick_params(size=0)
    ax.yaxis.tick_left()

    plt.title('Timeline of Gradle Task and Test Execution')
    plt.xlabel('Time')
    plt.ylabel('Execution Order')
    plt.show()
#    plt.savefig('myfig')


if __name__ == '__main__':
    main()

So how do I get this grouped dataframe full of timestamps into a single chart that shows the start/end timeline?

It seems like I have been running into one problem or another with regex, dataframes, datetime, etc., but I think I'm getting a good clean solution....

score 1 · Answer 1 · answered Jun 18 '18 at 20:03

My first association with this question was to use plt.barh - but I have to admit I struggled a while with the datetime / time topics until the result was as I intended...

However, this is what came out of the idea:

Assumed, the following dataframe would be the start:

df
Out: 
      timestamp            name
0  21:51:38.622            test
1  21:51:38.991  cha.LoginTest4
2  21:51:39.054  cha.LoginTest2
3  21:51:40.068  cha.LoginTest4
4  21:51:40.101  cha.LoginTest2
5  21:51:40.366  cha.LoginTest1
6  21:51:40.413  cha.LoginTest3
7  21:51:50.435  cha.LoginTest1
8  21:51:50.463  cha.LoginTest3
9  21:51:50.484            test

At first I'd group by name and create a new dataframe, which contains start and duration data in matplotlib.dates datatype:

grpd = df.groupby('name')
plot_data = pd.DataFrame({'start': dt.date2num(pd.to_datetime(grpd.min().timestamp)), 'stop':  dt.date2num(pd.to_datetime(grpd.max().timestamp))}, grpd.min().index)

Subtract the first start time to go from zero (still adding 1 because this is how matplotlib.dates starts)

plot_data -= plot_data.start.min() - 1
plot_data['duration'] = plot_data.stop - plot_data.start

Based on this dataframe, plotting a horizontal barchart over time is easy:

fig, ax = plt.subplots(figsize=(8,4))
ax.xaxis_date()
ax.barh(plot_data.index, plot_data.duration, left=plot_data.start, height=.4)
plt.tight_layout()

Hi. This solution looks good too. I already accepted the solution from Peter Leimbigler, however. Thanks. — patronizing_bofh, Jun 19 '18 at 00:28

Peter Leimbigler · Accepted Answer · 2018-06-16T22:06:28.903

0

Can't test now, sorry, but this (or something close) should help: Create a single figure before the plotting loop, then plot data from each group onto a single axis.

fig, ax = plt.subplots()
ax.xaxis_date()
for name, group in df.groupby('name', sort=False):

    group.amin = group['timestamp'].iloc[0] # assume sorted order
    group.amax = group['timestamp'].iloc[1]

    ax.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))

plt.show()

edited Jun 16 '18 at 22:06

answered Jun 16 '18 at 17:13

Peter Leimbigler

10,775
1
23
37

I get: `AttributeError: 'NoneType' object has no attribute 'hlines'`. It looks like `ax` is NoneType from `ax = ax.xaxis_date()` – patronizing_bofh Jun 16 '18 at 20:32
@patronizing_bofh, I forgot to change the `fig, ax` setup lines. I've edited the answer; try it now. – Peter Leimbigler Jun 16 '18 at 22:07
Ok great. I was able to get it working. I updated my original post with the correct full source. Thanks. – patronizing_bofh Jun 17 '18 at 16:49

Plotting grouped Pandas dataframe containing datetime into a single timeline

2 Answers2