0

Context
I'm trying to produce plots across a dataframe for value_counts. I'm unable to share the dataset I've used as its work related. But have used another dataset below.

Blocker
There are 3 main issues:

  • This line "plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));" causes a
    "ValueError: arange: cannot compute length.
  • The xticks overlap
  • The xticks at times aren't at the frequency specified below
# load dataset
df = sns.load_dataset('mpg')
# subset dataset
df_num = df.select_dtypes(['int64', 'float64'])

# Loop over columns - plots
for c in df_num.columns:
            fig = plt.figure(figsize= [10,5]);
            bins1 = df_num[c].nunique()+1
            
#           plot
            ax = df[c].plot(kind='hist', color='orange', bins=bins1, edgecolor='w');
            
#           dynamic xtick frequency
            if df_num[c].nunique() <=30:
                aaa = 1
            elif 30< df_num[c].nunique() <=50:
                aaa = 3
            elif 50< df_num[c].nunique() <=60:
                aaa = 6
            elif 60< df_num[c].nunique() <=70:
                aaa = 7
            elif 70< df_num[c].nunique() <=80:
                aaa = 8
            elif 80< df_num[c].nunique() <=90:
                aaa = 9
            elif 90< df_num[c].nunique() <=100:
                aaa = 10
            elif 90< df_num[c].nunique() <=100:
                aaa = 20
            else:
                aaa = 40
            
#           format plot
            plt.xticks(np.arange(min(df_num[c]),max(df_num[c])+1, aaa));
            ax.set_title(c)

@Cimbali
The ticks are at times at the edgepoint and other times partly in bin.
Is it possible to have one or the other?
enter image description here

IOIOIOIOIOIOI
  • 277
  • 3
  • 15
  • 1
    Please [do not post images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) of your data or errors. You can include [code that creates a dataframe such as `df.to_dict()` or the output of `print(df)`](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) (include at least the few rows and columns that allow to reproduce the example) – Cimbali Aug 15 '21 at 11:53

1 Answers1

0

TL;DR: define histogram bins and ticks based on the range of values and not the number of unique values.


Your histogram plots make some assumptions that might not be verified, in particular that all unique values are distributed identically. If that’s not the case − which in general it isn’t − then the range from min to max has little to do with the number of unique values (especially with floating point values, where unique values mean very little).

In particular, when you plot histograms, your bins (on the x-axis) correspond to the values (left). If you plot bars (right), you would get one bar per unique value, but not distributed based on the x-axis.

Here’s a simple example:

>>> s = pd.DataFrame([1, 1, 2, 5])
>>> s.plot(kind='hist') 
>>> s.value_counts().plot(kind='bar')

enter image description here enter image description here

You see there’s only 3 unique values but the index range (and number of bars) is from min to max on the histogram (left). If you only defined 3 bins, then 1 and 2 would be in the same bar.

The bar plot (right) has bar counts proportional to the number of unique values, but then the your x-axis is not proportional to the values anymore.


So instead, let’s define the number of bars and indexes from the range of values:

>>> df_range = df_num.max() - df_num.min()
>>> df_range
mpg               37.6
cylinders          5.0
displacement     387.0
horsepower       184.0
weight          3527.0
acceleration      16.8
model_year        12.0
dtype: float64
>>> df_bins = df_range.div(10).round().astype(int).clip(lower=df_range.transform(np.ceil), upper=50)
>>> df_bins
mpg             39
cylinders        6
displacement    50
horsepower      50
weight          50
acceleration    18
model_year      13
dtype: int64

Here’s an example of plotting using these number of bins:

>>> for col, n in df_bins.iteritems():
...   fig = plt.figure(figsize=(10,5))
...   df[col].plot.hist(bins=n, title=col)

enter image description here enter image description here

You can also define xticks additionally to bin sizes, but again for histograms you have to take the range into account, not the number of unique values (so you could compute ticks from bins too), but your rules make for some pretty weird results, especially on very wide ranges:

>>> ticks = pd.Series(index=df_range.index, dtype=int)
>>> ticks[df_range < 30] = 1
>>> ticks[(30 < df_range) & (df_range <= 50)] = 3
>>> ticks[(50 < df_range) & (df_range <= 100)] = np.floor(df_range.div(10)) + 1
>>> ticks[100 < df_range] = 40
>>> for col, n in df_bins.iteritems():
...   fig = plt.figure(figsize=(10,5))
...   df[col].plot.hist(bins=n, title=col, xticks=np.arange(df[col].min(), df[col].max() + 1, ticks[col]))

enter image description here enter image description here

Note that you could also use np.linspace to define the ticks from the min, max, and number (instead of min, max, and interval).

Cimbali
  • 11,012
  • 1
  • 39
  • 68
  • Cimbali, refer to bottom of posted question. The iteration only produces 1 plot. Any help will be greatly appreciated – IOIOIOIOIOIOI Aug 15 '21 at 18:29
  • 1
    @IOIOIOIOIOIOI I did a `plt.show()` at each iteration, but you can also simply put your `fig = plt.figure(figsize= [10,5])` back in. – Cimbali Aug 15 '21 at 18:33
  • Cimbali, many thanks. Last question, is it possible to have the ticks either always at the edgecolor or mid bin. Refer to bottom of post for images. – IOIOIOIOIOIOI Aug 15 '21 at 18:51
  • 1
    @IOIOIOIOIOIOI you could define the bins with their bounds. starting at min - 0.5 and ending at max + 0.5, e.g. instead of `bins=n`, you could do something like `bins=np.linspace(df[col].min() - .5, df[col].max() + .5, n)` (maybe `n + 1` instead of `n` and also maybe ± `.5 * binsize` instead of `.5`). Try some stuff out and if you can’t figure it out I think that’s a good topic for a new question. – Cimbali Aug 15 '21 at 19:05