Feature Selection in dataset containing both string and numerical values?

Question

Hi I have big dataset which has both strings and numerical values ex.

User name (str) , handset(str), number of requests(int), number of downloads(int) ,.......

I have around 200 such columns.

Is there a way/algorithm which can handle both strings and integers during feature selection ? Or how should I approach this issue.

thanks

Your question is way too broad. What have you tried? What do you need to do with the data? — ecline6, Apr 07 '13 at 21:50
Not a package specific question but yaah it would be great to know what packages are helpful in this case. I have a data as described above, each column being a feature (200 features in total), of types integer and string. I want to find out what all features contribute towards "download (boolean 0/1)". So I only want to select only those features that affect "download". I guess most of the Feature selection algorithms take only real numbers as input. — cryp, Apr 08 '13 at 01:47

score 0 · Answer 1 · answered Apr 08 '13 at 19:18

Feature selection algorithms assigns weights to different features based on their impact in the classification. In my best knowledge the features types does not make difference when computing different weights. I suggest to convert string features to numerical based on their ASCII codes or any other techniques. Then you can use the existing feature selection algorithm in rapid miner.

score 0 · Answer 2 · answered Apr 14 '13 at 19:55

There are a set of operators you could use in the Attribute Weighting group within RapidMiner. For example, Weight By Correlation or Weight By Information Gain.

These will assess how much weight to give an attribute based on its relevance to the label (in this case the download flag). The resulting weights can then be used with the Select by Weights operator to eliminate those that are not needed. This approach considers attributes by themselves.

You could also build a classification model and use the forward selection operators to add more and more attributes and monitor performance. This approach will consider the relationships between attributes.

score 0 · Answer 3 · answered Jul 29 '13 at 09:20

I've used Weka Feature Selection and although the attribute evaluator methods I've tried can't handle string attributes you can temporary remove them in the Preprocess > Filter > Unsupervised > Attribute > RemoveType, then perform the feature selection and, later, include strings again to do the classification.

Feature Selection in dataset containing both string and numerical values?

3 Answers3