I am working on a homework assignment that involves Clustering and Classification and need some help as I am stuck.
I have a file with around 10000 lines each with a random sentence such as
he likes computer science jobs
he has worked in the medical industry before
she likes to play with kids
he has had 5 years experience in computer science field.
I need to to build a multiple clusters out of all the input setences and then put each sentence into a cluster.
For Example:
COMPUTER SCIENCE: he likes computer science jobs
COMPUTER SCIENCE: he has had 5 years experience in computer science field.
KIDS: she likes to play with kids
MEDICAL: he has worked in the medical industry before
Now the Clusters dont need to be called Computer Science, Kids, Medical etc, but they will have number assignements.
What I Have Done:
- Read The File and Cleaned each line by REMOVING STOP WORDS, LOWERCASE ENTIRE SENTENCE, REMOVE PUNCTUATION AND OTHER NON ALPHANUMERIC LETTERS, STEM THE WORDS USING PORTER..
Currently I have two things:
a DICT in the format of ID(0-10000): CLEAN SENTENCE
a DICT in the format of WORD: COUNT for each clean word in all 10000 sentences that is unique after being stemmed and cleaned from the string.
What would be my next step? Is this when I implement KNN or KMeans etc?