0

Hello I am a machine learning newbie. I need some help with unsupervised clustering of high dimentional data. I have data with over 15 dimensions with around 50 - 80 thousand rows. The data looks something like this (15 participants with almost equal number of rows each and 15 features) -

Participant time feature 1 feature 2...
1 0.05 val val
1 0.10 val val
2 0.05 val val
2 0.10 val val
2 0.15 val val

The data consists of many participants, each participant has multiple rows of data and they are time stamped with their features. My goal is to cluster this data according to participants and make inferences based on these clusters. The problem here is that there are many rows for each participant and I cannot represent each participant with a single point so clustering them seems like a difficult task.

I would like help with:

  1. What would be the best way to cluster this data so that I can make inferences according to the participant ?

  2. Which clustering technique should I use? I have tried sklearn's Kmeans, meanshift and other libraries but they take too long and crash my system.

Sorry If it's a bit difficult to understand I will try my best to answer your questions. Thank you in advance for the help. If this question is very similar to some other question please let me know (I was not able to find it).

Thank you :)

1 Answers1

0

Since you have trouble with the necessary amount of compute you have to make some sort of compromise here. Here's a few suggestions that will likely fix your problem, but they all come at a cost.

  1. Dimension reduction i.e. PCA to reduce your number of columns to ~2 or so. You will lose some information, but you will be able to plot it and do inference via K-means.

  2. Average the patients data. Not sure if this will be enough, this depends on you data. This will lose the over-time observation of your patients but likely drastically reduce your number of rows.

My suggestion is to do dimension reduction since losing the over time data on your patients might render your data useless. There is also other stuff beside PCA, for example auto encoders. For clustering the way your descibe I'd recommend you stick to K-means or soft K-means.

tnfru
  • 296
  • 1
  • 10
  • Thank you for your answer, it seems to be the first way to proceed. I will continue as suggested and post updates as soon as possible. – Sidharth Kaliappan Aug 23 '21 at 11:34
  • Glad if it helps. I've also figured you maybe want to group data per per patient by concating the respective rows to a matrix together and then use auto encoders to reduce dimensions. This can definitely yield better results than PCA. If you choose to use PCA, you might want to think about what percentage of variance you want to retain. [This thread](https://stackoverflow.com/questions/33509074/sklearn-pca-calculate-of-variance-retained-for-choosing-k) offers an explanation on how to use sklearn to determine number of features given the variance to retain. – tnfru Aug 23 '21 at 12:20