5

I'm trying to update below function to report the clusters info via legend:

color_names = ["red", "blue", "yellow", "black", "pink", "purple", "orange"]

def plot_3d_transformed_data(df, title, colors="red"):
 
  ax = plt.figure(figsize=(12,10)).gca(projection='3d')
  #fig = plt.figure(figsize=(8, 8))
  #ax = fig.add_subplot(111, projection='3d')
  

  if type(colors) is np.ndarray:
    for cname, class_label in zip(color_names, np.unique(colors)):
      X_color = df[colors == class_label]
      ax.scatter(X_color[:, 0], X_color[:, 1], X_color[:, 2], marker="x", c=cname, label=f"Cluster {class_label}" if type(colors) is np.ndarray else None)
  else:
      ax.scatter(df.Type, df.Length, df.Freq, alpha=0.6, c=colors, marker="x", label=str(clusterSizes)  )

  ax.set_xlabel("PC1: Type")
  ax.set_ylabel("PC2: Length")
  ax.set_zlabel("PC3: Frequency")
  ax.set_title(title)
  
  if type(colors) is np.ndarray:
    #ax.legend()
    plt.gca().legend()
    
  
  plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
  plt.show()

So I call my function to visualize the clusters patterns by:

plot_3d_transformed_data(pdf_km_pred,
                         f'Clustering rare URL parameters for data of date: {DATE_FROM}  \nMethod: KMeans over PCA \nn_clusters={n_clusters} , Distance_Measure={DistanceMeasure}',
                         colors=pdf_km_pred.prediction_km)

print(clusterSizes)

Sadly I can't show the legend, and I have to print clusters members manually under the 3D plot. This is the output without legend with the following error: No handles with labels found to put in legend. enter image description here

I check this post, but I couldn't figure out what is the mistake in function to pass the cluster label list properly. I want to update the function so that I can demonstrate cluster labels via clusterSizes.index and their scale via clusterSizes.size

Expected output: As here suggests better using legend_elements() to determine a useful number of legend entries to be shown and return a tuple of handles and labels automatically.

Update: As I mentioned in the expected output should contain one legend for cluster labels and the other legend for cluster size (number of instances in each cluster). It might report this info via single legend too. Please see below example for 2D: img

Mario
  • 1,631
  • 2
  • 21
  • 51
  • 2
    I don't fully understand all of your issues, but I have simplified your code and borrowed some of @meTchaikovsky's data to create a graph. Do you mean that you want to create this legend for each cluster? The purpose of this legend is to visualize the size, so I am not sure if it can be created for each cluster. Also, it is possible to visualize the size without dividing it into clusters. – r-beginners Sep 02 '21 at 06:50
  • i'm also a bit confused. e.g., the question mentions `clusterSizes.index` and `clusterSizes.size` which sounds like a dataframe, but the code uses `str(clusterSizes)` which wouldn't make sense for a dataframe. – tdy Sep 02 '21 at 06:59
  • it would help to see `clusterSizes` if it's indeed a dataframe and ideally a sketch/mock-up of the expected output – tdy Sep 02 '21 at 07:01
  • @r-beginners thanks for providing the notebook for quick debug. I included the update at the end of the post to make it clear. I checked the notebook the 2nd legend indicates predicted cluster labels are still missing. – Mario Sep 02 '21 at 21:34
  • 1
    @tdy thanks for your input. The results of the clustering algorithm could be reported/passed via spark dataframe for BigData. The matter is providing automated legends to indicate clustering results in the term of cluster labels & cluster size to understand the pattern of outliers using embedded methods (e. g. PCA) for top features for better visualizing. Please see this [notebook](https://colab.research.google.com/drive/1DMBMlICT-iq5_i5Oz-NC5WS4eBPRdgrB#scrollTo=l7QDjHsfhxf0). I want to update the function for plotting and visualizing all clusters info automatically. – Mario Sep 02 '21 at 22:13
  • Updated Colab with two legend examples showing class and size. The color setting must be numeric instead of string to be supported. If you want to set it to an arbitrary color, you will need to create your own standardized color map. I will answer if this code is OK. If you don't need an answer, delete the comment containing the Colab link. – r-beginners Sep 03 '21 at 03:23
  • @r-beginners thanks for update the colab link the problem is the shape of `clusterSizes.shape` is `(8, 1)` this is the short results of clustering algorithm I reported in the form of the [dataframe](https://i.imgur.com/CV0LpXt.jpg) while `pdf_km_pred.shape` which is `(921325, 10)`. I also wanted to create the standard map color using `plt.set_cmap('jet')`. When you generated data you all x :`Type`, y :`Length`, z:`Frequency`, as well as `size` & `colors`. Due to mismatching and I get error. So I can't use `c=np.arange(10), s=df['size']` and report the expected plot with desired legends. – Mario Sep 05 '21 at 14:46
  • Plz see the notebook I shared in previous comment for @tdy and you can update it and form the final answer here since today is last 24hrs before bounty gets expired. Your solution produces the expected output if you consider mismatching problem due to scatter plot arguments are set/fed from different dataframes I explained. – Mario Sep 05 '21 at 14:53
  • I read your comment and my understanding is not up to par. What am I missing in other answers or in my trial. remove the Colab link. I will remove the Colab link, if that is okay. – r-beginners Sep 05 '21 at 14:58

2 Answers2

1

In the function to visualize the clusters, you need ax.legend instead of plt.legend

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
import numpy as np
import pandas as pd

color_names = ["red", "blue", "yellow", "black", "pink", "purple", "orange"]

def plot_3d_transformed_data(df, title, colors="red"):
 
  ax = plt.figure(figsize=(12,10)).gca(projection='3d')
  #fig = plt.figure(figsize=(8, 8))
  #ax = fig.add_subplot(111, projection='3d')
  

  if type(colors) is np.ndarray:
    for cname, class_label in zip(color_names, np.unique(colors)):
      X_color = df[colors == class_label]
      ax.scatter(X_color[:, 0], X_color[:, 1], X_color[:, 2], marker="x", c=cname, label=f"Cluster {class_label}" if type(colors) is np.ndarray else None)
  else:
      ax.scatter(df.Type, df.Length, df.Freq, alpha=0.6, c=colors, marker="x", label=str(clusterSizes)  )

  ax.set_xlabel("PC1: Type")
  ax.set_ylabel("PC2: Length")
  ax.set_zlabel("PC3: Frequency")
  ax.set_title(title)
  
  if type(colors) is np.ndarray:
    #ax.legend()
    plt.gca().legend()
    
  
  ax.legend(bbox_to_anchor=(.9,1), loc="upper left")
  plt.show()

clusterSizes = 10

test_df = pd.DataFrame({'Type':np.random.randint(0,5,10),
                        'Length':np.random.randint(0,20,10),
                        'Freq':np.random.randint(0,10,10),
                        'Colors':np.random.choice(color_names,10)})

plot_3d_transformed_data(test_df,
                         'Clustering rare URL parameters for data of date:haha\nMethod: KMeans over PCA \nn_clusters={n_clusters} , Distance_Measure={DistanceMeasure}',
                         colors=test_df.Colors)

Running this example code, you will have legend handle as expected enter image description here

meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34
  • Thanks for your input but OP asked for including/equipping function such a way that the plot indicates info for both *clusters' size* and *clusters' label* as it is shown in the update of the post (please see the example for 2D). in your solution 2nd legend is missing. – Mario Sep 02 '21 at 22:05
0

You need to save the reference to the first legend and add it to your ax as a separate artist before creating the second legend. That way, the second call to ax.legend(...) does not erase the first legend.

For the second legend, I simply created a circle for each unique color and added it in. I forgot how to draw real circles, so instead I use a Line2D with lw=0, marker="o" which results in a circle.

Play around with the legend's bbox_to_anchor and loc keywords to get a result that satisfies you.

I got rid of everything relying on plt.<something> because it's the best way to forget which method is attached to which object. Now everything is in ax.<something> or fig.<something>. It's also the right approach for when you have several axes, or when you want to embed your canvas in a PyQt app. plt will not do what you expect there.

The initial code is the one provided by @r-beginners and I simply built upon it.

# Imports.
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np

# Figure.
figure = plt.figure(figsize=(12, 10))
ax = figure.add_subplot(projection="3d")
ax.set_xlabel("PC1: Type")
ax.set_ylabel("PC2: Length")
ax.set_zlabel("PC3: Frequency")
ax.set_title("scatter 3D legend") 

# Data and 3D scatter.
colors = ["red", "blue", "yellow", "black", "pink", "purple", "orange", "black", "red" ,"blue"]

df = pd.DataFrame({"type": np.random.randint(0, 5, 10),
                   "length": np.random.randint(0, 20, 10),
                   "freq": np.random.randint(0, 10, 10),
                   "size": np.random.randint(20, 200, 10),
                   "colors": np.random.choice(colors, 10)})

sc = ax.scatter(df.type, df.length, df.freq, alpha=0.6, c=colors, s=df["size"], marker="o")

# Legend 1.
handles, labels = sc.legend_elements(prop="sizes", alpha=0.6)
legend1 = ax.legend(handles, labels, bbox_to_anchor=(1, 1), loc="upper right", title="Sizes")
ax.add_artist(legend1) # <- this is important.

# Legend 2.
unique_colors = set(colors)
handles = []
labels = []
for n, color in enumerate(unique_colors, start=1):
    artist = mpl.lines.Line2D([], [], color=color, lw=0, marker="o")
    handles.append(artist)
    labels.append(str(n))
legend2 = ax.legend(handles, labels, bbox_to_anchor=(0.05, 0.05), loc="lower left", title="Classes")

figure.show()

enter image description here

Not related to the question: because of how markersize works for circles, one could use s = df["size"]**2 instead of s = df["size"].

Guimoute
  • 4,407
  • 3
  • 12
  • 28
  • 1
    Thanks for posting your solution and it works if I passed values to scatter arguments via single dataframe. However, May I draw your attention to [cloab notebook](https://colab.research.google.com/drive/1DMBMlICT-iq5_i5Oz-NC5WS4eBPRdgrB?usp=sharing) for quick debugging? Since I try to get a scatter plot using 2 different data frames, let's say `ax.scatter(x=df1[x], y=df1[y], z=df1[z])` and `ax.scatter(...., s=df2[clusterSize], c=df2[clusterSize])` I'll get some errors. – Mario Sep 06 '21 at 15:10
  • @Mario `handles` and `labels` are lists, so you can sum them with other lists to add more elements. For example, if you have `sc1 = ax.scatter(df1...)`, and `sc2 = ax.scatter(df2...)`, build your handles and labels like so: `h1, l1 = sc1.legend_elements(...)` `h2, l2 = sc2.legend_elements(...)` `handles = h1 + h2` `labels = l1 + l2`. If you have many dataframes to use, we could easily turn that into a loop if you need. – Guimoute Sep 07 '21 at 09:36
  • I couldn't manage to adapt your comment on [colab notebook](https://colab.research.google.com/drive/1DMBMlICT-iq5_i5Oz-NC5WS4eBPRdgrB?usp=sharing). may I ask you to apply on the provided notebook for quick debugging? – Mario Sep 07 '21 at 10:28