1

I am almost done with my first real deal python data science project. However, there is one last thing I can't seem to figure out. I have the following code to create a plot for my PCA and K Means clustering algorithm:

y_axis = passers_pca_kmeans['Component 1']
x_axis = passers_pca_kmeans['Component 2']

plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=passers_pca_kmeans['Segment'], palette=['g','r','c','m'])
plt.title('Clusters by PCA Components')
plt.grid(zorder=0,alpha=.4)

texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(
    passers_pca_kmeans['Component 2'], passers_pca_kmeans['Component 1'], passers_pca_kmeans.name)]

adjust_text(texts)

plt.show
  • I finally got the correct code to annotate the points using adjustText, but my plot has too many points to label them all; it looks like a mess with text everywhere.
  • I would like to annotate the scatterplot based on the value in the column 'Segment'.
    • The values in this column are the names of my four clusters 'first', 'second', 'third', 'fourth'.
  • How do I alter my adjustText code to only annotate points where 'Segment'='first'?
    • Would this be an np.where situation?
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
bismo
  • 1,257
  • 1
  • 16
  • 36
  • [This answer](https://stackoverflow.com/a/14434334/1609514) shows how to add labels near data points individually. In the example they loop over all the points but you don't have to. – Bill Jun 08 '20 at 04:57
  • Oh wait, you're using Seaborn. But it may still work I'm not sure. – Bill Jun 08 '20 at 04:59
  • Does this answer your question? [Adding labels in x y scatter plot with seaborn](https://stackoverflow.com/questions/46027653/adding-labels-in-x-y-scatter-plot-with-seaborn) – Trenton McKinney Jun 08 '20 at 05:04
  • That's where I'm at right now. However, labeling all data points is too much of a mess. I want to label certain data points based on a column value in my data frame. – bismo Jun 08 '20 at 05:07
  • The answers in the duplicate show using the entire dataframe, you just need to [Boolean select](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) the points you want and pass that instead of the entire dataframe. – Trenton McKinney Jun 08 '20 at 05:10

1 Answers1

1

You could boolean slice your input into the text call, something like:

mask = (passers_kca_means["Subject"] == "first")
x = passers_kca_means["Component 2"][mask]
y = passers_kca_means["Component 1"][mask]
names = passers_kca_means.name[mask]

texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(x,y,names)]

You could also make an unruly list comprehension by adding an if condition:


x = passers_kca_means["Component 2"]
y = passers_kca_means["Component 1"]
names = passers_kca_means.name
subjects = passers_kca_means["Subject"]

texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name,subject in zip(x,y,names,subjects) if subject == "first"]

I bet there is an answer with np.where as well.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • Awesome! This worked. Now, is there a possible way to extend the distance between the text and the points? – bismo Jun 08 '20 at 05:14
  • 1
    Does calling `adjust_text` not work here? I'm not familiar with that module. – Tom Jun 08 '20 at 05:15
  • It does, I just need to figure out how to get a line that leads the text to its correct point now. I will check the documentation – bismo Jun 08 '20 at 05:23