New to scikit.learn and kmeans, How to clusterize documents (From File) using K means?

Question

What i am trying to do is using Kmeans from scikit.learn to Clusterize pure text documents into Two Categories.

Here is use-case scenario. I will recieve a few sample sets which are going to be tagged as "Important" and which are Going to be tagged as "Un-Important".

From scikit.learn examples data set is predefined format from newsgroups :

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

What i want to do is to receive data from Text files (20newsgroups seems not text file at all , i cannot even unzip it)

What i am not clear is the data structure of that fetch_20newsgroups and how it works. And what should i do to convert Text Files into required format (Such one provided by fetch_20newsgroups)

Thanks

Phyo.

score 5 · Accepted Answer · answered Oct 08 '12 at 15:51

The 20 newsgroups dataset loader shipped with scikit-learn fetches the archive of text documents downloaded from the original site at http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html and then cache them in a compressed format in the $HOME/scikit_learn_data folder. Have a look at the source code of the 20 newsgroups dataset loader for more details.

To load your own set of text files as a scikit-learn "bunch" object you can use the sklearn.datasets.load_files function directly by pointing it to the right folder.

If your data is already categorized into 2 categories (e.g. two subfolders named "Important" and "Un-Important") then what you need to use is not a clustering algorithm which is unsupervised but a classification such as MultinomialNB (Naive Bayes), LinearSVC (Linear Support Vector Machine) or LogisticRegression which are supervised as in the text classification example.

If you don't know which document belongs to which category but want to group your corpus into 2 groups of similar documents then you can use unsupervised clustering algorithms such as KMeans but it's very unlikely that the 2 clusters you will get match your idea "Important" and "Un-important".

Thank you very much , i may have more question for NB and LinearSVC after testing .load_files method. I will invite you there. — Phyo Arkar Lwin, Oct 08 '12 at 18:14
hey @ogrisel , can you answer my question here? http://stackoverflow.com/q/13068257/200044 i am planning to implement multiprocessing on scikit-learn — Phyo Arkar Lwin, Oct 25 '12 at 23:00

New to scikit.learn and kmeans, How to clusterize documents (From File) using K means?

1 Answers1