0

I have a set of 2000 points which are basically x,y coordinates of pass origins from association football. I want to run a k-means clustering algorithm on it to just classify it to get which 10 passes are the most common (k=10). However, I don't want to predict any points for future values. I simply want to work with the existing data. Do I still need to split it into testing-training sets? I assume they're only done when we want to train the model on a particular set to calculate for future values (?) I'm new to clustering (and Python as a whole) so any help would be appreciated.

Abhishek
  • 553
  • 2
  • 9
  • 26

2 Answers2

-1

No, in clustering (i.e unsupervised learning ) you do not need to split the data

  • Could you explain why? I have talked to people much more smarter than me and they all say that the testing/training bit is a must but it doesn't seem so obvious to me since I am new to this. I'm mostly after knowing the 'why' and 'how' behind what exactly happens. Mostly the 'why' – Abhishek Apr 01 '19 at 08:44
  • 3
    just ask your self, what type of algorithm is clustering ? its unsupervised, so what is the point of splitting the data (i.e you cannot use the test data to validate the model since you do not have target labels). This is different in supervised learning where you have samples with labels. – Kiruparan Balachandran Apr 01 '19 at 08:50
  • Thanks a lot. I guess I get it now – Abhishek Apr 01 '19 at 09:39
-1

I disagree with the answer. Clustering has accuracy as a metric. If you do not split the data into train and test then most likely you'll be overfitting the model. See these similar question 1, 2, 3. Please note, data splitting into train/test set is unrelated to the supervised or unsupervised problem.

mnm
  • 1,962
  • 4
  • 19
  • 46
  • Could you explain how "splitting into test/train set is unrelated to supervised or unsupervised problem"? I thought you could only train the model in supervised learning. Is that not accurate? – Abhishek Apr 07 '19 at 05:15
  • @Abhishek seems you have your fundamentals incorrect. Your question is about model building. A model can be built using a supervised/unsupervised method. In building a model, you have to ensure the model works properly. So if you dont split the data, and train the model on the whole dataset and test it again on the same dataset, then your model will be `overfitting` (because its already seen the full data). Read [cluster evaluation metrics](https://scikit-learn.org/stable/modules/clustering.html). The point is whether it is supervised or unsupervised you must evaluate the model. – mnm Apr 07 '19 at 05:29
  • 1
    accuracy of a clustering algorithm measures based on distances (inter cluster and intra cluster distance), would like to hear a working example from you, how splitting can help to measure the accuracy ? – Kiruparan Balachandran Apr 08 '19 at 06:02
  • @kiruparan-balachandran the world wide web is your friend! There are numerous examples littered around. Make an effort to find it out and study it. If you can't then at least read my previous comment on **cluster evaluation metric**! – mnm Apr 08 '19 at 06:37
  • 1
    yes I did, please be kind enough to go and read each evaluation metric and understand how it works, for your reference, you will find this sentence in most evaluation metrics "These metrics require the knowledge of the ground truth classes while almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting)". – Kiruparan Balachandran Apr 08 '19 at 09:06
  • follow up to previous comment of my, and some of them using the model itself to evaluate (not by splitting) for an example "Silhouette Coefficient"'. That is why i kindly requested you to give us a working example where you can perform splitting on evaluation. Thanks – Kiruparan Balachandran Apr 08 '19 at 09:06