5

I am looking for a way to descriptively scatter a pandas.DataFrame similar to this:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   type    1000 non-null   object
 1   value   1000 non-null   int64
 2   count   1000 non-null   int64
dtypes: int64(2), object(1)
memory usage: 23.6+ KB

Using pandas.DataFrame.plot or seaborn.scatterplot, the points for each type are all placed on one vertical line overlapping each other. To mitigate this issue I want to introduce at least some jitter in the x-direction but I don't know how.

My plots so far:

import pandas as pd
import matplotlib.pyplot as plt
import random

df = pd.DataFrame({
    'type': [random.choice(['t1', 't2', 't3']) for _ in range(1000)],
    'value': [random.randint(0, 500) for _ in range(1000)],
    'count': [random.randint(0,250) for _ in range(1000)],
    })

df.plot(kind='scatter', x='type', y='value', c='count', cmap='Blues')
plt.show()

scatterplot using pandas

import seaborn as sns

sns.scatterplot(x='type', y='value', data=df, hue='count')
plt.show()

scatterplot seaborn

upe
  • 1,862
  • 1
  • 19
  • 33

2 Answers2

5

I managed to jitter the types by encoding the types with numeric values and then jitter them instead. However, this requires at least 1 more column in the DataFrame.

import pandas as pd
import matplotlib.pyplot as plt
import random

df = pd.DataFrame({
    'type': [random.choice(['t1', 't2', 't3']) for _ in range(1000)],
    'value': [random.randint(0, 500) for _ in range(1000)],
    'count': [random.randint(0,250) for _ in range(1000)],
    })

def jitter(x):
    return x + random.uniform(0, .5) -.25

type_ids = {'t1': 1, 't2': 2, 't3': 3}

df['type_id'] = df['type'].apply(lambda x: type_ids[x])
df['jitter_type'] = df['type_id'].apply(lambda x: jitter(x))

df.plot(kind='scatter', x='jitter_type', y='value', c='count', cmap='Blues')
plt.xticks([1,2,3])
plt.gca().set_xticklabels(['t1', 't2', 't3'])
plt.show()

jittered scatterplot

upe
  • 1,862
  • 1
  • 19
  • 33
3

The problem with your approach is that seaborn's scatterplot lacks specific functionality that makes sense in the context of categorical data, e.g., jitter. Hence, seaborn provides "scatterplots for caterogical data": stripplot or swarmplot. But seaborn creates an ... interesting figure legend. We have to get rid of this and replace it with a colorbar:

#fake data generation
import pandas as pd
import numpy as np

np.random.seed(123)
ndf = 1000
df = pd.DataFrame({
    'Type': [np.random.choice(['t1', 't2', 't3']) for _ in range(ndf)],
    'Val': [np.random.randint(0, 700) for _ in range(ndf)],
    'Cou': [np.random.randint(0, 500) for _ in range(ndf)],
    })
    
#now the actual plotting  
import seaborn as sns
from matplotlib import colors, cm
import matplotlib.pyplot as plt
    
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

#preparation for the colorbars
pal = "coolwarm"
normpal = colors.Normalize(df.Cou.min(), df.Cou.max())

#stripplot display
sns.stripplot(x="Type", y="Val", data=df, hue="Cou", palette=pal, ax=ax1, jitter=0.2)
ax1.get_legend().remove()
ax1.set_title("stripplot")
fig.colorbar(cm.ScalarMappable(cmap=pal, norm=normpal), ax=ax1)

#swarmplot display
sns.swarmplot(x="Type", y="Val", data=df, hue="Cou", palette=pal, ax=ax2)
ax2.get_legend().remove()
ax2.set_title("swarmplot")
fig.colorbar(cm.ScalarMappable(cmap=pal, norm=normpal), ax=ax2)

plt.tight_layout()
plt.show()

Sample output: enter image description here

Mr. T
  • 11,960
  • 10
  • 32
  • 54
  • 1
    > The problem with your approach is that by definition seaborn's scatterplot is for numerical data. I wouldn't say this. The categorical plotting functions in seaborn explicitly treat all data as categorical, but the inverse is not true; `scatterplot` (by virtue of the categorical support in matplotlib) handles categorical variables perfectly fine. But it does currently lack some features (like jitter) that make sense only or primarily in the context of categorical data. – mwaskom Nov 21 '20 at 17:42
  • Not much to add. True dat. Edited the description based on your input. – Mr. T Nov 21 '20 at 18:17
  • 1
    Exactly what I was looking for. With the help of [this answer](https://stackoverflow.com/a/15913419/7259176) I was able to label the `colorbar` ([`matplotlib.axes.Axes.set_ylabel`](https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.axes.Axes.set_ylabel.html)). – upe Nov 21 '20 at 18:42