0

We are going through a task which is about Text Categorization and we use one of unsupervised machine learning Models.

before we do text Clustering, there are several steps that the data set must go through such as cleaning it from the stop words extract the stem words form the text and then getting the Feature Selection.

Reading about feature selection, there are several methods that i can apply for feature selection such as Information Gain, Gini Index and Mutual Information.

I would like to know the nature of the these methods and how i can implement them in the coding part, is there any library that i can use to perform these task.

S Gaber
  • 1,536
  • 7
  • 24
  • 43

3 Answers3

0

You shouldn't select features.

Text follows power law so there aren't "uncommon words" or unused features that you can skip. The information is hidden in the tail of the distribution and not among most frequent words.

If you do want to bound dimensionality for computational efficiency (Reuters is considered small for text) you should deploy a hashing based approach.

  • I suppose you go for some standard TF-IDF feature represenation and treat words as features.
iliasfl
  • 559
  • 7
  • 15
  • Wouldn't feature selection with TF-IDF keep such information and discard the most frequent words that provide little discrimination power? – Vanquish46 May 06 '14 at 20:54
0

Using Feature Selection can help Text Categorization, depending on the application domain. In topics (theme-based categories) like e.g. Economy, Politics, Sports and so on, stemming, stoplisting and selecting words and word n-grams usually works well. In other problems, like spam detection, using stop words in the representation can improve accuracy.

The question is: the style of the text is important in the application domain? If yes, you should keep stop words and avoid stemming, but you can always perform feature selection using e.g. those features with top Information Gain scores.

You can perform stoplisting and stemming in WEKA via the StringToWordVector filter. You can use WEKA for feature selection using the AttributeSelection filter, with search method Ranker and evaluation metric InfoGainAttributeEval. Get more details in my page on Text Mining with WEKA (sorry for the SSP).

0

first we have to generate arff file.

arff file format below:

@RELATION section will contain all words present in your whole document after preprocessing .Each word will be of type real because tfidf value is a real value.

@data section will contain their tfidf value calculated during preprocessing. for example first will contain tfidf value all words present in first document an at last column the document categary.

@RELATION filename
@ATTRIBUTE word1 real
@ATTRIBUTE word2 real
@ATTRIBUTE word3 real
.
.
.
.so on
@ATTRIBUTE class {cacm,cisi,cran,med}

@data
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.55454479562,0.1619617,0.579562,0.5542,cisi
0.5545479562,0.27,0.554544479562,0.4479562,cisi
0.0,0.2396113617,0.44479562,0.2,cran
0.5545479562,0.27,0.554544479562,0.4479562,carn
0.5545177444479562,0.26196113617,0.0,0.0,med
0.5545479562,0.27,0.554544479562,0.4479562,med

after you generate this file you can give this file as input to InfoGainAttributeEval.java. and this working for me.

Ashish
  • 1,943
  • 2
  • 14
  • 17