1

After specifying grouping by column a and restricting to column f and g for histogram, I still have column a showing up in green. Is there a way to remove it without going into matplotlib or for loop?

axes = dfs.hist(column=['f', 'g'], by='a', layout=(1, 3), legend=True, bins=np.linspace(0, 8, 10),
            sharex=True, sharey=True)
Simon
  • 703
  • 2
  • 8
  • 19
  • It would certainly have helped to provide sample data... see [reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – BigBen Apr 27 '21 at 20:17
  • @BigBen Thanks for the reminder. Indeed I should have included example data. Nice to have ALollz to include it in the answer. – Simon Apr 27 '21 at 20:58

1 Answers1

1

This is clearly a bug with the pandas library. The problem seems to arise when by is a numeric dtype column -- it probably subsets the DataFrame to the labels in column and by and then plots that, which is problematic when by is numeric.

You can either create non-numeric labels for the column that defines your 'by', or if you don't want to change your data, it suffices to re-assign the type to object just before the plot.

Sample Data

import pandas as pd
import numpy as np

df = pd.DataFrame({'length': np.random.normal(0, 1, 1000),
                   'width': np.random.normal(0, 1, 1000),
                   'a': np.random.randint(0, 2, 1000)})

# Problem with a numeric dtype for `by` column
df.hist(column=['length', 'width'], by='a', figsize=(4, 2))

enter image description here

# Works fine when column type is object
(df.assign(a=df['a'].astype('object'))
   .hist(column=['length', 'width'], by='a' , figsize=(4, 2)))

enter image description here

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Thanks for spotting it. Changing the data type is a neat way. Will open an issue on pandas github. – Simon Apr 27 '21 at 21:02