1

Note: A full reproduction notebook for this question can be found on GitHub.

I have a data set with a distribution of HTTP response codes that I would like to group by class. Sample data can be generated like so:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

mock_http_response_data = pd.DataFrame({
    'response_code':np.repeat([200, 201, 202, 204, 302, 304, 400, 404, 500, 502], 250 ),
})

I have added a column to the data, based on 'response status', called 'response class'. The response class contains the label corresponding to the class of the particular response:

  • 2xx: success
  • 3xx: warning
  • 4xx: client error
  • 4xx: server error

The function to determine the response class is:

def determine_response_class(row):    
    response_code = row['response_code']

    if response_code >= 200 and response_code < 300:
        return 'success'
    elif response_code >= 300 and response_code < 400:
        return 'warning'
    elif response_code >= 400 and response_code < 500:
        return 'client_error'
    elif response_code >= 500 and response_code < 600:
        return 'server_error'
    else:
        return 'unknown'

And the column is added like so:

# Add 'Response class' column to API Logs, where response class is determined by HTTP status code
mock_http_response_data['response_class'] = mock_http_response_data.apply(determine_response_class, axis='columns')

The 'response status' (HTTP status code) data plots correctly with a basic countplot:

sns.countplot(
    x='_source.response_status',
    data=results_df,
    color='teal',
    saturation=0.7)

uniform status code distribution

When I try to create a FacetGrid of countplots, the charts seem to work, but the labels are incorrect:

grid = sns.FacetGrid(mock_http_response_data, col='response_class')

grid.map(sns.countplot, 'response_code')

enter image description here

I would expect that the FacetGrid of countplots would have the following x-axis labels:

  • 200
  • 201
  • 202
  • 302
  • 304
  • 400
  • 404
  • 500
  • 502

How can I create a FacetGrid of countplots so that the labels are correct and the faceted data are sorted from high to low (e.g. the 'success' class column)?

Brylie Christopher Oxley
  • 1,684
  • 1
  • 17
  • 34
  • What about creating a [mcve] of the issue? How else could anyone know if the labels are incorrect as you claim? – ImportanceOfBeingErnest Sep 22 '17 at 10:44
  • The pictures in the question depict the data. The first chart shows the overall distribution of the data with correct (x axis) labels, the second chart simply slices the data into four segments (2xx, 3xx, 4xx, 5xx). If you compare the charts vertically, you will notice that they have a strong correspondence, but the second picture has incorrect labels. – Brylie Christopher Oxley Sep 22 '17 at 10:50
  • I added about as much detail to the original question as I can, without publishing the actual data. – Brylie Christopher Oxley Sep 22 '17 at 10:59
  • Well, maybe you did not get my point. You are basically asking someone to create some dataframe to reproduce the issue, which might be possible, but a waste of time. If instead you create some data yourself and provide a [mcve], people are much more inclined on helping you. It is of course your choice at the end. – ImportanceOfBeingErnest Sep 22 '17 at 11:01
  • I added a full reproduction notebook for this question, including data: https://github.com/brylie/jupyter_http_status_code_visualization/blob/master/http_status_code_visualization.ipynb – Brylie Christopher Oxley Sep 22 '17 at 11:44
  • This **is not** a [mcve]. It requires to have some file `'http_response_status_data.csv'` available. But even if we have access to it, it is not be the way to ask a question here. Nobody cares about your actual data anyways. What is needed is a dataset that reproduces the issue. You would want to create such a dataset within the code (`df = pd.Dataframe(....)`) such that people are able to reproduce the issue and help you find a solution or provide one as an answer. Again, feel free to ignore that advice, but be aware that you won't get help soon in that case. – ImportanceOfBeingErnest Sep 22 '17 at 11:49
  • The data file is in the GitHub repository. https://github.com/apinf/jupyter-api-log-analysis/blob/master/api_logs.csv – Brylie Christopher Oxley Sep 22 '17 at 11:50
  • The question and issue are fully reproduced in the GitHub repository: https://github.com/apinf/jupyter-api-log-analysis – Brylie Christopher Oxley Sep 22 '17 at 11:51
  • While you are waiting for someone to dive into your case, you might want to read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). It may make you change your mind. – ImportanceOfBeingErnest Sep 22 '17 at 12:00
  • Change my mind about what? – Brylie Christopher Oxley Sep 22 '17 at 12:01
  • I have worked to provide a reproducible example, even including the data. I want the DataFrame to be correct, so opted to include the actual CSV in the reproduction repository. Thank you for the links and guidance. – Brylie Christopher Oxley Sep 22 '17 at 12:03
  • I am also trying to provide a link to the notebook on MyBinder, but am getting some unrelated errors there. – Brylie Christopher Oxley Sep 22 '17 at 12:06
  • Please remove the downvote from the question. I have showed good faith in making the changes you requested and by providing a full reproduction. – Brylie Christopher Oxley Sep 22 '17 at 12:08
  • I will not change my voting principles because of "good faith". I've given enough hints on how to make this question useful. If after some time a question is still [worthy for close-voting as off-topic](https://i.stack.imgur.com/KpNP0.png), I will in general down-vote it. – ImportanceOfBeingErnest Sep 22 '17 at 12:15
  • I have added a full stand-alone reproduction in the issue itself, including screenshots. – Brylie Christopher Oxley Sep 22 '17 at 12:47
  • Ok, what is the issue about sorting now? Should the codes be sorted within each class? – ImportanceOfBeingErnest Sep 22 '17 at 13:05
  • Well, sorting would be nice, so it is easier to interpret. At the least, however, the labels should be correct. Right now, it seems to be using labels from the fourth column facet on all four facets. – Brylie Christopher Oxley Sep 22 '17 at 13:06
  • What shall be sorted by what? – ImportanceOfBeingErnest Sep 22 '17 at 13:14
  • Each column in the grid (e.g. 'success', 'warning') would have its row values (e.g. count of 200, count of 201) sorted from high to low. In the contrived example, the data are in a uniform distribution, but the natural data, as in my reproduction repository and the original version of this question, have varying counts. – Brylie Christopher Oxley Sep 22 '17 at 13:18
  • Ok, I added that option to my answer below. – ImportanceOfBeingErnest Sep 22 '17 at 14:01

1 Answers1

4

The problem of wrong labels appears because by default, the x axes of the subplots are shared, hence all plots will have the same x-axis as the last plot.

You can use the sharex=False argument in order to prevent sharing of the axes:

grid = sns.FacetGrid(df, col='class', sharex=False)

enter image description here

import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns

codes = [200, 201, 202, 204, 302, 304, 400, 404, 500, 502]
p = np.random.rand(len(codes))
p = p/p.sum()
df = pd.DataFrame({ 'code': np.random.choice(codes, size=300, p=p) })

def determine_response_class(row):    
response_code = row['code']

if response_code >= 200 and response_code < 300:
    return 'success'
elif response_code >= 300 and response_code < 400:
    return 'warning'
elif response_code >= 400 and response_code < 500:
    return 'client_error'
elif response_code >= 500 and response_code < 600:
    return 'server_error'
else:
    return 'unknown'

df['class'] = df.apply(determine_response_class, axis='columns')

grid = sns.FacetGrid(df, col='class', sharex=False)

grid.map(sns.countplot, 'code')

plt.show()

The problem of sorting is now a chicken-or-egg problem. In order to set the order of the columns you need to know the counts for each, which are determined as part of the plotting. At this point it is probably wise to stick to a clear separation between data generation, analysis and visualization. The following would show a sorted graph, without the use of FacetGrid, by first counting an sorting the values in the dataframe.

import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns

codes = [200, 201, 202, 204, 302, 304, 400, 404, 500, 502]
p = np.random.rand(len(codes))
p = p/p.sum()
df = pd.DataFrame({ 'code': np.random.choice(codes, size=300, p=p) })

def determine_response_class(row):    
    response_code = row['code']

    if response_code >= 200 and response_code < 300:
        return 'success'
    elif response_code >= 300 and response_code < 400:
        return 'warning'
    elif response_code >= 400 and response_code < 500:
        return 'client_error'
    elif response_code >= 500 and response_code < 600:
        return 'server_error'
    else:
        return 'unknown'

df['class'] = df.apply(determine_response_class, axis='columns')

df2 = df.groupby(["code","class"]).size().reset_index(name="count") \
        .sort_values(by="count", ascending=0).reset_index(drop=True)

fig, axes = plt.subplots(ncols=4, sharey=True, figsize=(8,3))
for ax,(n, group) in zip(axes, df2.groupby("class")):
    sns.barplot(x="code",y="count", data=group, ax=ax, color="C0", order=group["code"])
    ax.set_title(n)

plt.tight_layout()
plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712