Cluster groups by direction and magnitude - Python

Question

I'm hoping to cluster vectors based on the direction and magnitude using python. I've found limited examples using R but none for python. Not to confuse with standard k-means for scatter points, I'm actually trying to cluster the whole vector.

The following takes two sets of xy points to generate a vector. I'm then hoping to cluster these vectors based on the length and direction.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'

fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)

A = df['A']
B = df['B']

C = df['C']
D = df['D']

ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5) 

X_1 = np.array(df[['A','B','C','D']])

model = KMeans(n_clusters = 20)
model.fit(X_1)

cluster_labels = model.predict(X_1)
df['n_cluster'] = cluster_labels
centroids_1 = pd.DataFrame(data = model.cluster_centers_, columns = ['start_x', 'start_y', 'end_x', 'end_y'])
cc = model.cluster_centers_

a = cc[:, 0]
b = cc[:, 1]
c = cc[:, 2]
d = cc[:, 3]

lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.8)

The following figure displays an example

There is a similar question on Data Science Stack Exchange: https://datascience.stackexchange.com/questions/34198/clustering-2-dimensional-euclidean-vectors-appropriate-dissimilarity-measure — fdermishin, Nov 25 '20 at 11:25
Thanks, I didn't see this. I think the question still holds as there are only suggestions in that post. — jonboy, Nov 26 '20 at 04:48

tgrandje · Accepted Answer · 2020-11-26T12:21:56.410

What about this :

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import hdbscan

df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'

A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive

clusterer = hdbscan.HDBSCAN()

df['LENGTH'] = np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B))
df['DIRECTION'] = np.degrees(np.arctan2(df.D-df.B, df.C-df.A))


coords = df[['LENGTH', 'DIRECTION']].values
clusterer.fit_predict(coords)

cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
        [(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
        columns=["points", "weight"]
        )

colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
    df_this_cluster = pd.DataFrame(cluster, columns=['LENGTH', 'DIRECTION'])
    df_this_cluster['TEMP'] = x
    df = df.merge(df_this_cluster, on=['LENGTH', 'DIRECTION'], how='left')
    ix = df[df.TEMP.notnull()].index
    df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
    df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')

fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)

ax.quiver(df.A, df.B, (df.C-df.A), (df.D-df.B), angles='xy', scale_units='xy', scale=1, alpha=0.5, color=df.COLOR)

This will use clustering based on length and direction (direction being transformed to degrees, radians' small range doesn't match very well with the model on my first try).

I don't think this will be a very "cartesian" solution as the two values beeing analysed in the model are not in the same metrics... But the visual results are not so bad...

I did try another match based on the 4 coordinates, which is more rigorous. But it is (quite expectably) clustering the vectors by subareas of the space (when there are any) :

coords = df[['A', 'B', 'C', 'D']].values
clusterer.fit_predict(coords)

cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
        [(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
        columns=["points", "weight"]
        )

colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
    df_this_cluster = pd.DataFrame(cluster, columns=['A', 'B', 'C', 'D'])
    df_this_cluster['TEMP'] = x
    df = df.merge(df_this_cluster, on=['A', 'B', 'C', 'D'], how='left')
    ix = df[df.TEMP.notnull()].index
    df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
    df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')

EDIT

I gave it another try, based on the (very good) suggestion that angles are not a good variable given the fact that there are discontinuities around 0/2pi ; so I choose to use both sinuses and cosinuses instead. I also scaled the length (to have matching scales for the 3 variables) :

So the result would be :

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import robust_scale
import hdbscan

df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'


A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive
clusterer = hdbscan.HDBSCAN()


df['LENGTH'] = robust_scale(np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B)))
df['DIRECTION'] = np.arctan2(df.D-df.B, df.C-df.A)
df['COS'] = np.cos(df['DIRECTION'])
df['SIN'] = np.sin(df['DIRECTION'])


columns = ['LENGTH', 'COS', 'SIN']

clusterer = hdbscan.HDBSCAN()
values = df[columns].values
clusterer.fit_predict(values)

cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
        [(values[cluster_labels==n], len(values[cluster_labels==n])) for n in range(num_clusters)],
        columns=["points", "weight"]
        )


def get_cmap(n, name='hsv'):
    '''
    Returns a function that maps each index in 0, 1, ..., n-1 to a distinct 
    RGB color; the keyword argument name must be a standard mpl colormap name.
    
    Credits to @Ali
    https://stackoverflow.com/questions/14720331/how-to-generate-random-colors-in-matplotlib#answer-25628397
    '''
    return plt.cm.get_cmap(name, n)

cmap = get_cmap(num_clusters+1)
colors = {x:cmap(x) for x in range(num_clusters)}
df['CLUSTER'] = np.nan


for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
    df_this_cluster = pd.DataFrame(cluster, columns=columns)
    df_this_cluster['TEMP'] = x
    df = df.merge(df_this_cluster, on=columns, how='left')
    df.reset_index(drop=True, inplace=True)
    
    ix = df[df.TEMP.notnull()].index
    df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
    df.drop("TEMP", axis=1, inplace=True)
    
df['CLUSTER'] = df['CLUSTER'].fillna(num_clusters-1)
df['COLOR'] = df['CLUSTER'].map(colors)
print("Number of clusters : ", num_clusters-1)

nrows = num_clusters//2 if num_clusters%2==0 else num_clusters//2 + 1
fig,axes = plt.subplots(nrows=nrows, ncols=2)
axes = [y for row in axes for y in row]
for k,ax in enumerate(axes):

    ax.set_xlim(-5, 25)
    ax.set_ylim(-5, 25)
    ax.set_aspect('equal', adjustable='box')
    if k+1 <num_clusters:
        ax.set_title(f"CLUSTER #{k+1}", fontsize=10)
    this_df = df[df.CLUSTER==k]
    ax.quiver(
        this_df.A, #X
        this_df.B, #Y
        (this_df.C-this_df.A), #X component of vector
        (this_df.D-this_df.B), #Y component of vector
        angles = 'xy', 
        scale_units = 'xy', 
        scale = 1, 
        color=this_df.COLOR
        )

The results are way better (though it depends much of the input dataset) ; the last subplots refers to the vectors not being found to be inside a cluster:

Edit #2

If by "direction" you mean angle in the [0..pi[ interval (ie undirected vectors), you will want to include the following code before computing the cosinuses/sinuses :

ix = df[df.DIRECTION<0].index
df.loc[ix, "DIRECTION"] += np.pi

This clustering has discontinuity when direction is pi or -pi. It would be better to use normalized 2D vector and length (3 parameters) instead of direction and length (2 parameters) to avoid the discontinuity — fdermishin, Nov 25 '20 at 11:22
@user13044086 : good point. But I'd say you will get to the same result as with clustering on the 4 values (Xstart, Ystart, Xend, Yend). Am I mistaken ? — tgrandje, Nov 25 '20 at 12:08
I don't think that results will be the same, because normalized vector and length are used, then direction and length become independent. But if you use Cartesian coordinates, then you can't distinguish very short vectors, and very long vectors become different even for slightest changes of direction — fdermishin, Nov 25 '20 at 12:22
@tgrandje, I prefer the second version. Although, both limit the amount of clusters to 5 plus 'extras'. Could we implement some form of elbow method and random colormap so it's not constrained? — jonboy, Nov 25 '20 at 23:51
I used a function to map colors and used @user13044086 's suggestion to change variables (instead of using vectors, I tried the clusterisation on both sinuses, cosinus and length). The result is way better, though not perfect. — tgrandje, Nov 26 '20 at 08:51

meTchaikovsky · Answer 2 · 2020-11-26T07:35:27.833

Maybe you can also cluster the angles (besides the vector norms) by the projections of a normalized vector onto the two unit vectors (1,0) and (0,1) with this function. Handling the projections directly (which are essentially the angles), we won't went into trouble caused by the periodicity of cosine function

def get_norm_and_angle(e1):

    e1_norm = np.linalg.norm(e1,axis=1)
    e1 = e1 / e1_norm[:,None]
    e2 = np.array([1,0])
    e3 = np.array([0,1])
    
    return np.stack((e1_norm,e1@e2,e1@e3),axis=1)

Based on this function, here is one possible solution where there is no constraint on how many clusters we want to find. In the script below, five features are used for clustering

Vector norm
Vector projections on x and y axis
Vector starting points

with these five features

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np
from sklearn.cluster import KMeans


def get_norm_and_angle(e1):

    e1_norm = np.linalg.norm(e1,axis=1)
    e1 = e1 / e1_norm[:,None]
    e2 = np.array([1,0])
    e3 = np.array([0,1])
    
    return np.stack((e1_norm,e1@e2,e1@e3),axis=1)


data = np.cumsum(np.random.randint(0,10,size=(50, 4)),axis=0)
df = pd.DataFrame(data, columns=list('ABCD'))

A = df['A'];B = df['B']
C = df['C'];D = df['D']

starting_points = np.stack((A,B),axis=1)
vectors = np.stack((C,D),axis=1) - np.stack((A,B),axis=1)
different_view = get_norm_and_angle(vectors)
different_view = np.hstack((different_view,starting_points))

num_clusters = 8
model = KMeans(n_clusters=num_clusters)
model.fit(different_view)

cluster_labels = model.predict(different_view)
df['n_cluster'] = cluster_labels
cluster_centers = model.cluster_centers_
cluster_offsets = cluster_centers[:,0][:,None] * cluster_centers[:,1:3]
cluster_starts = np.vstack([np.mean(starting_points[cluster_labels==ind],axis=0) for ind in range(num_clusters)])
main_streams = np.hstack((cluster_starts,cluster_starts+cluster_offsets))
a,b,c,d = main_streams.T

fig,ax = plt.subplots(figsize=(8,8))
ax.set_xlim(-np.max(data)*.1,np.max(data)*1.1)
ax.set_ylim(-np.max(data)*.1,np.max(data)*1.1)

colors = sns.color_palette(n_colors=num_clusters)
lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', color = colors, scale = 1, alpha = 0.8, zorder=100)
lc2 = ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = .6, alpha = 0.2)

start_colors = [colors[ind] for ind in cluster_labels]
ax.scatter(starting_points[:,0],starting_points[:,1],c=start_colors)

plt.show()

A sample output is

as you can see in the figure, vectors with close starting points are clustered into the same group.

Isn't `e1@e2` just `e1[:, 0]`? If that is the case, then return value can be simplified as `np.stack((e1_norm,e1),axis=1)` — fdermishin, Nov 26 '20 at 05:57
@user13044086 You are correct, but using `e1@e2` reminds me they are projections onto x and y axis. — meTchaikovsky, Nov 26 '20 at 06:03
Thanks @meTchaikovsky. I'm not overly concerned with the angle. If I had to rank feature importance, it would be length, starting_x, and then direction/angle. The `norm_angle` function creates `nan's`, which throws and error: `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` I'll make up some mock data — jonboy, Nov 26 '20 at 06:24
@jonboy You're welcome. A figure won't be of too much help ... Anyway, I updated my post, in which I also take into account the starting points for clustering. You can test this script on your actual dataset. — meTchaikovsky, Nov 26 '20 at 07:37

Cluster groups by direction and magnitude - Python

2 Answers2