Seaborn showing values not found in Pandas columns

Question

Original dataframe:

dp.head(10)

Creating new dataframe using recommended selection method:

dtest = pd.DataFrame(dp[dp['numdept'].isin([3,6,8,10])]).dropna()
dtest.reset_index(drop =True, inplace = True)
dtest.head(10)

Testing to make sure that only the values in [3,6,8,10] are in dtest['numdept']:

print "numdept is 5:", dtest[dtest["numdept"].isin ([5])]
print "set of distinct values in the numdept column:", sorted(set(dtest['numdept'].tolist()))

>> numdept is 5: Empty DataFrame
>> Columns: [numgrade, numyear, numdept]
>> Index: []
>> set of distinct values in the numdept column: [3, 6, 8, 10]

Plotting:

plt.figure(figsize=(16, 8))
sb.boxplot(x="numyear", y="numgrade", hue="numdept", data=dtest)

Question: Why are the "nummdept" categories in the plot legend showing values other than 3,6,8,10?

Problem surfaced in an ipython notebook, but recurs even when I carry the code to a regular environment. Also tried to avoid seaborn related issues by using the suggestion here, to no avail.

Using Canopy 1.7.4.3348, jupyter 1.0.0-15, pandas 0.19.0-1 matplotlib 1.5.1-9 and seaborn 0.7.0-6

EDIT: On an impulse, inserted the following before the plotting code:

grouped = dtest.groupby(['numdept', 'numyear'])
grouped.mean()

The output has numdept values that should not exist in dtest.

Does this make it a pandas bug?

This looks as expected to me. What do you feel is wrong exactly? — Little Bobby Tables, Dec 18 '16 at 00:32
@josh Shouldn't the plot legend should only show 3, 6, 8, 10? — user2738815, Dec 18 '16 at 00:43
Using the following gets me the 4 valued legend as you expect: `dp = pd.concat([pd.DataFrame(np.random.randint(1, 4, [100, 1])), pd.DataFrame(np.random.randint(1, 14, [100, 1])), pd.DataFrame([3]*20 + [6]*20 + [8]*20 + [10]*20 + [11]*20)], axis=1)`. Apologies, it is not very neat. Not sure why yours is not showing just 4. — Little Bobby Tables, Dec 18 '16 at 00:58
How is the original dataframe generated? Are any of the columns categorical? — BrenBarn, Dec 18 '16 at 02:21
I cannot reproduce the problem using matplotlib 1..3.1, pandas 0.19.1 and seaborn 0.7.1 with the code to generate the data shown in the answer by @josh. In fact, with the untouched data I get an extra entry in the legend which I don't get in case only a limited ([3,6,8,10]) set of values is selected from the "numdept" column, i.e. in the last case I get only entries in the legend for [3,6,8,10]. And this is true even without specifying the parameter hue_order in the call to sns.boxplot(). — fedepad, Dec 18 '16 at 02:31
Also, it looks like the code in seaborn that fills the entries in the legend for a boxplot hasn't really changed between 0.7.0 and 0.7.1. The list of unique values that then is "passed" to produce the legend is generated in the same way (same code). — fedepad, Dec 18 '16 at 02:39
Indeed, the code that generates this unique list ([3,6,8,10] in your case), handle the case in which no hue_order (hue_order=None) is passed and will order it and have unique values. — fedepad, Dec 18 '16 at 02:47
@BrenBarn Dataframe `dp` was created reading fom a csv file; `numyear` and `numdept` were added by mapping from other columns and typed as categorical using pandas' `.astype('category')` method. But unless there is a bug in pandas, how would this matter? The output of `dtest` is showing only four of these categories in the `numdept` column; the others are not represented in `dtest` (supposedly). — user2738815, Dec 18 '16 at 03:22
What values are present is not the same as what category labels exist. See [the documentation](http://pandas.pydata.org/pandas-docs/stable/categorical.html#working-with-categories). It may be considered a bug nonetheless, but you should try removing unused categories as described in the docs. — BrenBarn, Dec 18 '16 at 03:32
@BrenBarn Inserting `dtest['numdept'] = dtest['numdept'].cat.remove_categories([1,2,4,5,7,9,11])` right after the selection "solved" the problem. I also think this is a bug. Why don't you post your answer and I will select it. — user2738815, Dec 18 '16 at 04:05

Little Bobby Tables · Answer 1 · 2016-12-18T01:51:49.097

4

Why this is happening I am not certain, but there is an easy way to get it to use the desired [3, 6, 8, 10] legend you want.

#Create mock data
dp = pd.concat([pd.DataFrame(np.random.randint(1, 4, [100, 1])),
                pd.DataFrame(np.random.randint(1, 14, [100, 1])),
                pd.DataFrame([3.0]*20 + [6.0]*20 + [8.0]*20 + [10.0]*20 + [11.0]*20)], axis=1)
dp.columns = ["numyear", "numgrade", "numdept"]

dtest = pd.DataFrame(dp[dp['numdept'].isin([3,6,8,10])]).dropna()
dtest.reset_index(drop=True, inplace=True)

sns.boxplot(x="numyear", y="numgrade", hue="numdept", data=dtest,
            hue_order=[10, 3 , 8, 6])

Here I have added a hue_order and specified the order (I chose non-numeric order to emphasise this) and exact values I'd like to see. If specified [1, 2, 3, 6, 8, 10] it would give these as the legend.

Finally, you could generalise this nicely using the following,

sns.boxplot(x="numyear", y="numgrade", hue="numdept", data=dtest,
            hue_order=dtest.numdept.unique().sort(), width=0.2)

edited Dec 18 '16 at 01:51

answered Dec 18 '16 at 01:19

Little Bobby Tables

4,466
4
29
46

Nice, can you explain what you mean by "I chose non-numeric to emphasise this"? It seems the argument to `hue_order` is numeric, not non-numeric. – ImportanceOfBeingErnest Dec 18 '16 at 01:44
Non-numeric order. I have added "order" to make it make more sense :) – Little Bobby Tables Dec 18 '16 at 01:52
Thank you much for your help-upvoting. However, I would also like an answer to the original question. Knowing why this is happening may affect whether and how I will use any one of these packages in the future. – user2738815 Dec 18 '16 at 01:58
Of course. I shall have a further look tomorrow and see if I might add to my question then. – Little Bobby Tables Dec 18 '16 at 02:06

score 4 · Accepted Answer · answered Dec 18 '16 at 04:26

You are using a categorical variable. It appears the legend is based on the categories in the categorical variable, not the values that are actually present. A categorical variable may represent categories that don't actually occur in the data, and these categories are still shown in the legend.

As suggested in the documentation, you can do dtest.numdept.cat.remove_unused_categories() to remove the empty categories.

Seaborn showing values not found in Pandas columns

2 Answers2

Linked

Related