0

I am working on a homework assignment that involves Clustering and Classification and need some help as I am stuck.

I have a file with around 10000 lines each with a random sentence such as

he likes computer science jobs

he has worked in the medical industry before

she likes to play with kids

he has had 5 years experience in computer science field.

I need to to build a multiple clusters out of all the input setences and then put each sentence into a cluster.

For Example:

COMPUTER SCIENCE: he likes computer science jobs
COMPUTER SCIENCE: he has had 5 years experience in computer science field.
KIDS: she likes to play with kids
MEDICAL: he has worked in the medical industry before

Now the Clusters dont need to be called Computer Science, Kids, Medical etc, but they will have number assignements.

What I Have Done:

  • Read The File and Cleaned each line by REMOVING STOP WORDS, LOWERCASE ENTIRE SENTENCE, REMOVE PUNCTUATION AND OTHER NON ALPHANUMERIC LETTERS, STEM THE WORDS USING PORTER..

Currently I have two things:

  • a DICT in the format of ID(0-10000): CLEAN SENTENCE

  • a DICT in the format of WORD: COUNT for each clean word in all 10000 sentences that is unique after being stemmed and cleaned from the string.

What would be my next step? Is this when I implement KNN or KMeans etc?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Andy P
  • 111
  • 1
  • 10
  • in your problem, `COMPUTER SCIENCE` and `KIDS` are called features, not necessarily clusters. Clusters are regions in feature space where you have many entries, e.g. lots of records with `KIDS` **and** `COMPUTER SCIENCE` – Andre Holzner Nov 30 '14 at 08:01
  • @andreholzner I have thousans of records that may fit into computer science, that was just a small sample – Andy P Nov 30 '14 at 08:02
  • yes, the emphasis in my comment was not on 'many' but on the fact that clusters are usually characterized by multiple features (e.g. discover that the majority of records are tagged with `KIDS` **and** `COMPUTER SCIENCE`). – Andre Holzner Nov 30 '14 at 08:09
  • What you implement is directed by what you want your application to do. For example, do you already know how many clusters there shall be? Can a sentence belong to only one cluster (e.g. "Her younger kid wants to study computer science while the older one is studying medicine.")? If you know the clusters beforehand, and each sentence can only have one cluster, then you are better off building a classifier. – Chthonic Project Nov 30 '14 at 08:14
  • 1
    @AndyP, please clarify which question should be answered using the data, `assignment that involves Clustering and Classification` is quite vague. – Andre Holzner Nov 30 '14 at 08:23
  • If you wanted to use a supervised approach (and you don't have a training set to 'teach' a classifier) you could manually label about 100 sentences, including multiple samples of all classes, then use something like Google Prediction API (https://cloud.google.com/prediction/docs/hello_world) to make predictions. Also, the NLTK package looks like it has a sentence classifier: http://www.nltk.org/book/ch06.html In either case you'll need a training set. – Ryan Nov 30 '14 at 09:03

1 Answers1

1

Your next step should be to cluster the above cleaned txt where each cleaned sentence is a data point. You can use k-means from any of the data mining python libraries to get the clusters.

======== clustering=========

Now how do you decide the K in the k-means (i.e. the number of clusters): 1) by plotting the objective curve of the k-means and then picking the K that corresponds to the knee, or 2) using Bayesian information criteria, or 3) some other popular methods that suit your particular dataset. If you dont now about this then read up here How do I determine k when using k-means clustering?

Since it is a homework, I will say that learning experience counts more and hence you should try more than one of the above to get a "feel" for it.

At the end of this procedure you will have K clusters.

Now comes the classification part.

======== classification=========

Treat each of the K cluster as one class. There are many ways you can go about classifying each datapoint (i.e. cleaned sentence) into K classes: 1. Whatever cluster each datapoint was assigned to at the end of k-means you can treat this datapoint as having that class. 2. Take each cluster-centroid as the representative point for each class and use some similarity metric such as cosine, kl-divergence etc. to find similarity between a given datapoint and each of K representative class-points. Assign the datapoint to its closest class-point and hence that class.

Note that (1) above is the easiest.

========================================

There are various other methods for clustering (spherical k-means, agglomerative etc.) and that will change your classification step as well.

Community
  • 1
  • 1
Abhimanu Kumar
  • 1,751
  • 18
  • 20