Clustering discrete distributions in Python?

Question

My dataset is composed of records of music streamings from users. I have around 100 different music genres and I would like to cluster them depending on the distribution of ages of listeners.

To be more clear, ages of users are divided into "blocks" (A1: 0-10 years; A2: 11-20 years,..., A6: 61+) and thus an example of the data I would like to cluster is the following:
Pop: 0.05 A2; 0.3 A3; 0.35 A3; 0.2 A4; 0.05 A5; 0.05 A6
Rock: 0.05 A2; 0.2 A3; 0.2 A3; 0.1 A4; 0.15 A5; 0.1 A6

I would like to obtain clusters of genres with similar distributions. How can I do this in Python? Can I just treat each genre as a datapoint in a 6-dimensional space or should I use something more refined? For example, can I use a custmized distance for distirbutions in a clustering algorithm?

Thank you

See https://stackoverflow.com/questions/33721996/how-to-specify-a-distance-function-for-clustering using e.g. the RMS, see https://en.wikipedia.org/wiki/Root-mean-square_deviation — Carlos Horn, Jul 04 '22 at 10:48

score 0 · Answer 1 · answered Jul 04 '22 at 12:32

If you have prior knowledge to design your distance function with, all algorithms from scipy.cluster.hierarchy should support that.

My opinion: you should be fine with classic clustering methods from the problem statement, at least one (KMeans, Spectral, DBSCAN ... with proper parameters) should do the trick.

Clustering discrete distributions in Python?

1 Answers1