Gantt Chart for USGS Hydrology Data with Python?

Question

I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.

Here is a sample of my data.

The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?

First, I have a plot that shows where the missing data is.

import missingno as msno
msno.bar(dfp)

Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).

from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt

df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
] 
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))

How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.

Please note: values of zero are considered valid data points.

Thank you in advance for your feedback!

Can you edit your answer to include what you've tried so far, the output you got and describe how that different from what you want? There are some tips here on how to ask a good question. [Link](https://stackoverflow.com/help/how-to-ask) — Jason, Feb 06 '20 at 21:24
Good point, I should have included what I have done so far. I have now updated the question to include what I have done so far. Thank you! — Jeff Coldplume, Feb 07 '20 at 16:30

score 1 · Answer 1 · answered Feb 07 '20 at 20:29

Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.

The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.

This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)

new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)

### Plotting ###

fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names

cfort · Accepted Answer · 2020-02-13T01:15:27.750

Find date ranges of non-null data

2020-02-12 Edit to clarify logic in loop

df = pd.read_excel('Downloads/output.xlsx', index_col='date')

Make sure the dates are in order:

df.sort_index(inplace=True)

Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:

# Looping feels like defeat. However, I'm not clever enough to avoid it 
good_ranges = []
for i in df:
    col = df[i]
    gauge_name = col.name

    # Start of good data block defined by a number preceeded by a NaN
    start_mark = (col.notnull() & col.shift().isnull())
    start = col[start_mark].index

    # End of good data block defined by a number followed by a Nan
    end_mark = (col.notnull() & col.shift(-1).isnull())
    end = col[end_mark].index

    for s, e in zip(start, end):
        good_ranges.append((gauge_name, s, e))

good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])

Plotting

Nothing new here. Copied pretty much straight from your question:

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'], 
                dt.date2num(good_ranges['start']), 
                dt.date2num(good_ranges['end']))
fig.tight_layout()

I apologize, I have been out of town for a bit. Thank you for the response; I think this makes sense to me. I had to go through your code little by little to make sure I understood how things were working. One question just to clarify: essentially adding an extra date (with an associated nan value) to each end makes it so that when we find where the data runs from valid into a nan value, we aren't losing the valid data points once we use the .shift() call? — Jeff Coldplume, Feb 12 '20 at 22:23
Now that you point that out, the extending the index is not needed here since `shift` will provide `NaN`s. I've edited my answer. That index extension stuff was left over from solving the inverse problem -- find when the gauges were off line, the NaN groups. — cfort, Feb 13 '20 at 01:19
Great, that makes more sense to me. Thank you for the clarification! — Jeff Coldplume, Feb 13 '20 at 04:32

Gantt Chart for USGS Hydrology Data with Python?

2 Answers2

Linked