What i am trying to do is using Kmeans from scikit.learn to Clusterize pure text documents into Two Categories.
Here is use-case scenario. I will recieve a few sample sets which are going to be tagged as "Important" and which are Going to be tagged as "Un-Important".
From scikit.learn examples data set is predefined format from newsgroups :
dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
What i want to do is to receive data from Text files (20newsgroups seems not text file at all , i cannot even unzip it)
What i am not clear is the data structure of that fetch_20newsgroups and how it works. And what should i do to convert Text Files into required format (Such one provided by fetch_20newsgroups)
Thanks
Phyo.