0

How can I get sorted cumulative plots in numpy/matplotlib or Pandas?

Let me explain this with an example. Say we have the following data:

number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]

We want to plot a chart that, for a given (x,y) value is read as: the top %X selling stores sold %Y items. That is, it displays the data as follows:

                              enter image description here

where the best selling stores are to the left (i.e. the slope of the plot decreases monotonically). How can I do this in numpy or Pandas ? (i.e. assuming the above is a Series).

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

3 Answers3

2

Assuming that you want the best performing stores to come first:

import numpy as np
import matplotlib.pyplot as plt

number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]

ar = sorted(number_of_items_sold_per_store,reverse=True)
y = np.cumsum(ar).astype("float32")

#normalise to a percentage
y/=y.max()
y*=100.

#prepend a 0 to y as zero stores have zero items
y = np.hstack((0,y))

#get cumulative percentage of stores
x = np.linspace(0,100,y.size)

#plot
plt.plot(x,y)
plt.show()

enter image description here

ebarr
  • 7,704
  • 1
  • 29
  • 40
  • Thanks, this **does** work. Do you happen to know what is the name of this type of statistical plot? – Amelio Vazquez-Reina Nov 18 '14 at 01:22
  • I'm not really sure this has a name as the statistical meaning is not clear. The shape of the curve is dependent on the ordering of the input. It has similarities to ROC plots from machine learning, but really the statistical value of this is limited due to the x axis ambiguity. – ebarr Nov 18 '14 at 03:56
  • Thanks @ebarr. It is called a Lorentz plot: http://stats.stackexchange.com/questions/124465/what-is-the-name-of-this-cumulative-plot. – Amelio Vazquez-Reina Nov 18 '14 at 04:21
  • I think you mean a Lorenz plot: http://en.wikipedia.org/wiki/Lorenz_curve. This would make sense then as it requires a specific preordering of the data. – ebarr Nov 18 '14 at 05:19
1

I think the steps involved here are:

  • Sort the list of sale counts in descending order
  • Get the cumulative sum of the sorted list
  • Divide by the overall total and multiply by 100 to convert to percentage
  • Plot!

n_sold = number_of_items_sold_per_store
sorted_sales = list(reversed(sorted(n_sold)))
total_sales = np.sum(n_sold)
cum_sales = np.cumsum(sorted_sales).astype(np.float64) / total_sales
cum_sales *= 100  # Convert to percentage
# Borrowing the linspace trick from ebarr
x_vals = np.linspace(0, 100, len(cum_sales))
plt.plot(x_vals, cum_sales)
plt.show()

enter image description here

Marius
  • 58,213
  • 16
  • 107
  • 105
0

This works for me (you can convert ': number_of_items_sold_per_store' to numpy array using number_of_items_sold_per_store.values)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]

# Create histogram
values, base = np.histogram(number_of_items_sold_per_store, bins=500)

# Cumulative data
cum = np.cumsum(values)

# plot the cumulative function
plt.plot(base[:-1], cum, c='red')

plt.show()
user308827
  • 21,227
  • 87
  • 254
  • 417