2

I have a dataset with multiple categories and I want to plot in a single figure to see how something changes. I have a list of given categories in the data set that I'm would like to see it all plot in the same figure

sample = [
['For business', 0.7616104043587437],
['For home and cottages', 0.6890139579274699],
['Consumer electronics', 0.039868871866136635],
['Personal things', 0.7487893699793786],
['Services', 0.747226678171249],
['Services', 0.23463661173977313],
['Animals', 0.6504301798258314],
['For home and cottages', 0.49567857024037665],
['For home and cottages', 0.9852681814098107],
['Transportation', 0.8134867587477912],
['Animals', 0.49988690699674654],
['Consumer electronics', 0.15086800344617235],
['For business', 0.9485494576819328],
['Hobbies and Leisure', 0.25766871111905243],
['For home and cottages', 0.31704508627659533],
['Animals', 0.6192114570078333],
['Personal things', 0.5755788287287359],
['Hobbies and Leisure', 0.10106922056341394],
['Animals', 0.16834618003738577],
['Consumer electronics', 0.7570803588496894]
]
train = pd.DataFrame(data=sample,  columns=['parent_category_name','deal_probability'])
parent_categories = train['parent_category_name'].unique()
parent_categories_size = len(parent_categories)
fig, ax = plt.subplots(figsize=(12,10))
colors = iter(cm.rainbow(np.linspace(0, 1, parent_categories_size)))

for parent_category_n in range(parent_categories_size):
    parent_1 = train[train['parent_category_name'] == parent_categories[parent_category_name]]
    ax.scatter(
        range(parent_1.shape[0]), 
        np.sort(parent_1.deal_probability.values),
        color = next(colors)
    )
plt.ylabel('likelihood that an ad actually sold something', fontsize=12)
plt.title('Distribution of likelihood that an ad actually sold something')

I've no idea why I can only see the last plot instead of all of them. Alternatively I could work with having multiple scatter plots in one figure, but I'm having a hard time trying to plot this.

Currently I'm working with 10 categories but I'm trying to make it dynamic.

pablora
  • 506
  • 1
  • 7
  • 18
  • I've tried to use something similar to what is asked here (https://stackoverflow.com/questions/48380953/multiple-scatter-plots-with-matplotlib-and-strings-on-the-x-axis) but I'm only getting the last plot. – pablora May 26 '18 at 08:42
  • I'm using matplolib 2.1.2 – pablora Jun 02 '18 at 07:49
  • Sorry for that, editing the question. – pablora Jun 04 '18 at 07:47
  • Now, I can reproduce an output, but I can't reproduce your problem. The diagram displays all points. Several questions though 1) Do you mean `ylabel` instead of `xlabel`? This like plt.title doesn't need to be within the loop, because you only have to set it once. 2) Why do you retrieve first `parent_categories` from your dataframe and overwrite it then with a predefined list? 3) Your code does not use the categorical data, instead plots the probabilities in ascending order against the position number within the category. Is this the intention? – Mr. T Jun 04 '18 at 08:19
  • Thanks for following up on this! I'm trying to plot multiple figures (one for each category_name) so I can see if the likelihood of deal_probability grows higher for some of them. I have 10 category_names and I want to plot one graph for each of them. 1) You're right, fixed in my code and edited the question here. Also removed from the loop. 2) Another mistake I've introduced when trying to make it MCV 3) Yes that's the intention, to plot the probabilities in ascending order for each category in a different figure 2) – pablora Jun 04 '18 at 21:09

1 Answers1

3

If you want to observe the development over time, a line plot with markers is probably better to visualize the changes in each category:

import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.cm as cm

sample = [  ['For business', 0.7616104043587437],
            ['For home and cottages', 0.6890139579274699],
            ['Consumer electronics', 0.039868871866136635],
            ['Personal things', 0.7487893699793786],
            ['Services', 0.747226678171249],
            ['Services', 0.23463661173977313],
            ['Animals', 0.6504301798258314],
            ['For home and cottages', 0.49567857024037665],
            ['For home and cottages', 0.9852681814098107],
            ['Transportation', 0.8134867587477912],
            ['Animals', 0.49988690699674654],
            ['Consumer electronics', 0.15086800344617235],
            ['For business', 0.9485494576819328],
            ['Hobbies and Leisure', 0.25766871111905243],
            ['For home and cottages', 0.31704508627659533],
            ['Animals', 0.6192114570078333],
            ['Personal things', 0.5755788287287359],
            ['Hobbies and Leisure', 0.10106922056341394],
            ['Animals', 0.16834618003738577],
            ['Consumer electronics', 0.7570803588496894] ]

train = pd.DataFrame(data=sample,  columns=['parent_category_name','deal_probability'])
parent_categories = train['parent_category_name'].unique()

fig, ax = plt.subplots(figsize=(10,8))
colors = iter(cm.rainbow(np.linspace(0, 1, len(parent_categories))))

for parent_category in parent_categories:
    ax.plot(range(len(train[train["parent_category_name"] == parent_category])), 
            sorted(train[train["parent_category_name"] == parent_category].deal_probability.values),
            color = next(colors),
            marker = "o",
            label = parent_category)

plt.ylabel('likelihood that an ad actually sold something', fontsize=12)
plt.title('Distribution of likelihood that an ad actually sold something')
plt.legend(loc = "best")
plt.show()

Output:

enter image description here

But since this is an arbitrary scale and you sort the data, in my opinion you can even better see the spread in a categorical plot:

train = pd.DataFrame(data=sample,  columns=['parent_category_name','deal_probability'])
parent_categories = train['parent_category_name'].unique()

fig, ax = plt.subplots(figsize=(18,9))
colors = iter(cm.rainbow(np.linspace(0, 1, len(parent_categories))))

for parent_category in parent_categories:
    ax.scatter(
        train[train["parent_category_name"] == parent_category].parent_category_name.values, 
        train[train["parent_category_name"] == parent_category].deal_probability.values,
        color = next(colors),
        label = parent_category
    )

plt.ylabel('likelihood that an ad actually sold something', fontsize=12)
plt.title('Distribution of likelihood that an ad actually sold something')
plt.legend(loc = "best")
plt.show()

Output:

enter image description here

Mr. T
  • 11,960
  • 10
  • 32
  • 54