Python DataFrame - plot a bar chart for data frame with grouped-by columns (at least two columns)

Question

I've been struggling to recreate this Excel graph in python using matlplotlib:

The data is in a dataframe; I'm trying to automate the process of generating this graph.

I've tried unstacking my dataframe, subplotting, but I haven't managed to create the "Zone" index which is so elegant in Excel. I have successfully managed to plot the graph without this "Zone" index, but that's not really what I want to do.

Here is my code:

data = pd.DataFrame(
    {
        'Factory Zone':
        ["AMERICAS","APAC","APAC","APAC","APAC","APAC","EMEA","EMEA","EMEA","EMEA"],
        'Factory Name':
        ["Chocolate Factory","Crayon Factory","Jobs Ur Us", "Gibberish US","Lil Grey", "Toys R Us","Food Inc.",
        "Pet Shop", "Bonbon Factory","Carrefour"],
        'Production Day 1':
        [24,1,9,29,92,79,4,90,42,35],
        'Production Day 2':
        [2,43,17,5,31,89,44,49,34,84]
    })
df = pd.DataFrame(data)
print(df)
# Without FactoryZone, it works:
df = df.drop(['Factory Zone'], axis=1)
image = df.plot(kind="bar")

And the data looks like this:

  Unnamed: 0 FactoryZone       Factory Name  Production Day 1  Production Day 2
0           1    AMERICAS  Chocolate Factory                24                43
1           2    AMERICAS     Crayon Factory                 1                17
2           3        EMEA           Pet Shop                 9                 5
3           4        EMEA     Bonbon Factory                29                31
4           5        APAC           Lil Grey                92                89
5           6    AMERICAS         Jobs Ur Us                79                44
6           7        APAC          Toys R Us                 4                49
7           8        EMEA          Carrefour                90                34
8           9    AMERICAS       Gibberish US                42                84
9          10        APAC          Food Inc.                35                62

To match the data table and the Excel graph, the dictionary for the dataframe should be edited like this: `'Factory Zone': ['AMERICAS', 'AMERICAS', 'AMERICAS', 'AMERICAS', 'APAC', 'APAC', 'APAC', 'EMEA', 'EMEA', 'EMEA']` — Patrick FitzGerald, Feb 21 '21 at 12:29

score 6 · Accepted Answer · answered Feb 09 '21 at 14:50

You can create this plot by first creating a MultiIndex for your hierarchical dataset where level 0 is the Factory Zone and level 1 is the Factory Name:

import numpy as np                 # v 1.19.2
import pandas as pd                # v 1.1.3
import matplotlib.pyplot as plt    # v 3.3.2

df = pd.DataFrame(
    {'Factory Zone': ['AMERICAS', 'AMERICAS', 'AMERICAS', 'AMERICAS', 'APAC',
                      'APAC', 'APAC', 'EMEA', 'EMEA', 'EMEA'],
     'Factory Name': ['Chocolate Factory', 'Crayon Factory', 'Jobs Ur Us',
                      'Gibberish US', 'Lil Grey', 'Toys R Us', 'Food Inc.',
                      'Pet Shop', 'Bonbon Factory','Carrefour'],
     'Production Day 1': [24,1,9,29,92,79,4,90,42,35],
     'Production Day 2': [2,43,17,5,31,89,44,49,34,84]
    })

df.set_index(['Factory Zone', 'Factory Name'], inplace=True)

df

#                                   Production Day 1  Production Day 2
#  Factory Zone       Factory Name      
#      AMERICAS  Chocolate Factory                24                 2
#                   Crayon Factory                 1                43
#                       Jobs Ur Us                 9                17
#                     Gibberish US                29                 5
#          APAC           Lil Grey                92                31
#                        Toys R Us                79                89
#                        Food Inc.                 4                44
#         EMEA            Pet Shop                90                49
#                   Bonbon Factory                42                34
#                        Carrefour                35                84

Like Quang Hoang has proposed, you can create a subplot for each zone and stick them together. The width of each subplot must be corrected according to the number of factories by using the width_ratios argument in the gridspec_kw dictionary so that all the columns have the same width. Then there are limitless formatting choices to make.

In the following example, I choose to show separation lines only between zones by using the minor tick marks for this purpose. Also, because the figure width is limited here to 10 inches only, I rewrite the longer labels on two lines.

# Create figure with a subplot for each factory zone with a relative width
# proportionate to the number of factories
zones = df.index.levels[0]
nplots = zones.size
plots_width_ratios = [df.xs(zone).index.size for zone in zones]
fig, axes = plt.subplots(nrows=1, ncols=nplots, sharey=True, figsize=(10, 4),
                         gridspec_kw = dict(width_ratios=plots_width_ratios, wspace=0))

# Loop through array of axes to create grouped bar chart for each factory zone
alpha = 0.3 # used for grid lines, bottom spine and separation lines between zones
for zone, ax in zip(zones, axes):
    # Create bar chart with grid lines and no spines except bottom one
    df.xs(zone).plot.bar(ax=ax, legend=None, zorder=2)
    ax.grid(axis='y', zorder=1, color='black', alpha=alpha)
    for spine in ['top', 'left', 'right']:
        ax.spines[spine].set_visible(False)
    ax.spines['bottom'].set_alpha(alpha)
    
    # Set and place x labels for factory zones
    ax.set_xlabel(zone)
    ax.xaxis.set_label_coords(x=0.5, y=-0.2)
    
    # Format major tick labels for factory names: note that because this figure is
    # only about 10 inches wide, I choose to rewrite the long names on two lines.
    ticklabels = [name.replace(' ', '\n') if len(name) > 10 else name
                  for name in df.xs(zone).index]
    ax.set_xticklabels(ticklabels, rotation=0, ha='center')
    ax.tick_params(axis='both', length=0, pad=7)
    
    # Set and format minor tick marks for separation lines between zones: note
    # that except for the first subplot, only the right tick mark is drawn to avoid
    # duplicate overlapping lines so that when an alpha different from 1 is chosen
    # (like in this example) all the lines look the same
    if ax.is_first_col():
        ax.set_xticks([*ax.get_xlim()], minor=True)
    else:
        ax.set_xticks([ax.get_xlim()[1]], minor=True)
    ax.tick_params(which='minor', length=55, width=0.8, color=[0, 0, 0, alpha])

# Add legend using the labels and handles from the last subplot
fig.legend(*ax.get_legend_handles_labels(), frameon=False, loc=(0.08, 0.77))

fig.suptitle('Production Quantity by Zone and Factory on both days', y=1.02, size=14);

References: the answer by Quang Hoang, this answer by gyx-hh

You can find an alternative solution in [this answer by Stein](https://stackoverflow.com/a/39502106/14148248). — Patrick FitzGerald, Feb 09 '21 at 14:54

score 1 · Answer 2 · answered Feb 02 '21 at 19:25

An idea that gives a close plot is to plot to each Factory Zone in a subplot that are place next to each other:

# setting up the subplots
fig, axes = plt.subplots(1, len(df['Factory Zone'].unique()), 
                         figsize=(12,4),
                         sharex=True, sharey=True, 
                         gridspec_kw={'wspace':0},
                         subplot_kw={'frameon':False})

# use groupby to loop through the `Factory Zone`
for (k,d), ax in zip(df.groupby('Factory Zone'), axes):

    # plot the data into subplot
    d.plot.bar(x='Factory Name', ax=ax)
    
    # set label to the `Factory Zone`
    ax.set_xlabel(k)
    
    # remove the extra legend in each subplot
    legend = ax.legend()
    handlers = ax.get_legend_handles_labels()
    ax.legend().remove()
    ax.grid(True, axis='y')

# reinstall the last legend
ax.legend(*handlers)

Output:

Thanks Hoang! Would you have any suggestions on how to make duplicates disappear? For instance Pet Shop and Bonbon Factory have factories only in EMEA but they appear in Americas and APAC on this graph — Clara Loizeau, Feb 03 '21 at 10:30

score 1 · Answer 3 · answered Jun 23 '22 at 19:21

1

The solution offered by Patrick FitzGerald has a single line that was deprecated in Matplotlib 3.4 and will be removed in 2 minor releases. (I'd put this as a comment rather than an answer, but I don't have enough reputation yet!)

Change:

if ax.is_first_col():

to

if ax.get_subplotspec().is_first_col():

answered Jun 23 '22 at 19:21

Katrina Koehler

11
1

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/32092119) – BrokenBenchmark Jun 25 '22 at 15:29

Python DataFrame - plot a bar chart for data frame with grouped-by columns (at least two columns)

3 Answers3

Linked