1

When I build a scatterplot of this data, you can see see that the one large value (462) is completely swamping even being able to see some of the other points.

Does anyone know of a specific way to normalize this data, so that the small dots can see be seen, while maintaining a link between the size of the dot and the value size. I'm thinking would either of these make sense:

(1) Set a minimum value for the size a dot can be

(2) Do some normalization of the data somehow, but I guess the large data point will always be 462 compared to some of the other points with a value of 1.

Just wondering how other people get around this, so they don't actually miss seeing some points on the plot that are actually there? Or I guess is the most obvious answer just don't scale the points by size, and then add a label to each point somehow with the size.

enter image description here

Beso
  • 1,176
  • 5
  • 12
  • 26
Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60

2 Answers2

1
import pandas as pd
import numpy as np
import plotly.express as px

df = pd.DataFrame(
    {"Class": np.linspace(-8, 4, 25), "Values": np.random.randint(1, 40, 25)}
).assign(Class=lambda d: "class_" + d["Class"].astype(str))
df.iloc[7, 1] = 462

px.scatter(df, x="Class", y="Values", size=df["Values"].clip(0, 50))

enter image description here

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

This isn't really a question linking to Python directly, but more to plotting styles. There are several ways to solve the issue in your case:

  1. Split the data into equally sized categories and assign colorlabels. Your legend would look something like this in this case: 0 - 1: color 1 2 - 20: color 2 ... The way to implement this is to split your data into the sets you want and plotting seperate scatter plots each with a new color. See here or here for examples

  2. The second option that is frequently used is to use the log of the value for the bubble size. You would just have to point that out quite clearly in your legend.

  3. The third option is to limit marker size to an arbitrary value. I personally am not a bit fan of this method since it changes the information shown in a degree that the other alternatives don't, but if you add a data callout, this would still be legitimate.

These options should be fairly easy to implement in code. If you are having difficulties, feel free to post runnable sample code and we could implement an example as well.

C Hecht
  • 932
  • 5
  • 14