5

Suppose I have a Pandas dataframe with discrete values in a column.

import pandas as pd

data = ['A']*2 + ['C']*3 + ['B']* 1
print(data)
# ['A', 'A', 'C', 'C', 'C', 'B']

my_df = pd.DataFrame({'mycolumn': data})
print(my_df)
#   mycolumn
# 0        A
# 1        A
# 2        C
# 3        C
# 4        C
# 5        B

I then create a histogram showing the frequency of those values. I use the Pandas built-in function hist(), which in turn relies on the Matplotlib histogram function.

my_df.mycolumn.hist()

enter image description here

Now, how do I change the order of the labels on the X-axis to have a specific order? For example, I want the x-axis to have the labels in the specific order: C, A, B, not A, C, B as is shown.

Additionally, how do I change the y-axis to be integers rather than floats? The frequency values are discrete counts.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • 3
    You'll want to create a bar chart, not a histogram. – BigBen Jan 13 '21 at 20:02
  • 2
    For your new question: [matplotlib restrict to integer tick locations](https://stackoverflow.com/questions/11258212/python-matplotlib-restrict-to-integer-tick-locations). `from matplotlib.ticker import MaxNLocator`; `plt.gca().yaxis.set_major_locator(MaxNLocator(integer=True))` – JohanC Jan 13 '21 at 20:14
  • 2
    In addition to the answers shown, you can specify `mycolumn` as Categorical with [`pandas.Categorical`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html), where you can set the order. – Trenton McKinney Jan 13 '21 at 20:17
  • 2
    As already stated, this is really a bar plot with `count`, not a histogram. See this [question](https://stackoverflow.com/q/65691436/7758804) for an example. There are two answer, the one I gave for that OP shows a distribution (as the op wanted), the other answer shows a count of occurrences, as you are doing. I think the two answers will illustrate the difference for you. – Trenton McKinney Jan 13 '21 at 20:31
  • 1
    The main issue comes down to the fact that your data is categorical instead of numeric. The API for `hist` doesn't effectively `bin` categories, so you just have a bar plot. If you were to map all your letters to numbers, as an example, you would see how the `hist` api, bins groups of numbers to show a distribution. – Trenton McKinney Jan 13 '21 at 20:37

2 Answers2

10

You can use value_counts, loc to define order, and bar plot:

my_df['mycolumn'].value_counts().loc[['C', 'A', 'B']].plot.bar()

enter image description here

To use integers on the x-axis, add:

ax.yaxis.set_major_locator(MaxNLocator(integer=True))

enter image description here

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

You can create a sorter dict to sort your dataframe prior to plotting. For integers, you can use MaxNLocator:

import pandas as pd
from matplotlib.ticker import MaxNLocator
fig, ax = plt.subplots()
data = ['A']*2 + ['C']*3 + ['B']* 1
my_df = pd.DataFrame({'mycolumn': data})
sorter = dict([(k, v) for (v,k) in enumerate(['C', 'A', 'B'])])
(my_df.assign(sorter=my_df['mycolumn'].map(sorter))
      .sort_values('sorter')['mycolumn'].value_counts().plot.bar(ax=ax))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))

enter image description here

David Erickson
  • 16,433
  • 2
  • 19
  • 35