3

I'm having difficulties mining a big (100K entries) dataset of mine concerning logistics transportation. I have around 10 nominal String attributes (i.e. city/region/country names, customers/vessel identification codes, etc.). Along with those, I have one date attribute "departure" and one ratio-scaled numeric attribute "goal".

What I'm trying to do is using a training set to find out which attributes have strong correlations with "goal" and then validating these patterns by predicting the "goal" value of entries in a test set.

I assume clustering, classification and neural networks could be useful for this problem, so I used RapidMiner, Knime and elki and tried to apply some of their tools on my data. However, most of these tools only handle numeric data, so I got no useful results.

Is it possible to transform my nominal attributes into numeric ones? Or do I need to find different algorithms that can actually handle nominal data?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
hildebro
  • 549
  • 2
  • 5
  • 20
  • 1
    If you want to predict a numeric attribute you're doing regression (not classification) and some regression *algorithms*, in any tool, may need nominal inputs converted to numeric. I'm not clear exactly what you want to achieve though - is it to *identify the attributes that correlate with your target attribute*, or is it to *build a model that can predict the target attribute given the other attributes*? Many ML algorithms can give good predictions, if a relationship is actually present in your data, but not all of them will tell you how influential each attribute was on the prediction. – nekomatic Jun 20 '18 at 08:33

2 Answers2

4

you most likely want to use tree based algorithm. These are good to use nominal features. Please be aware, that you do not want to use "id-like" attributes.

I would recommend RapidMiner's AutoModel feature as a start. GBT and RandomForest should work well.

Best, Martin

2

the handling of nominal attributes does not depend on the tool. It is a question what algorithm you use. For example k-means with Euclidean distance can't handle string values. But other distance functions can handle them and algorithms can handle them, for example the random forest implementation of RapidMiner

You can also of course transform the nominal attributes to numerical, for example by using a binary dummy encoding or assigning an unique integer value (which might result in some bias). In RapidMiner you have the Nominal to Numerical operator for that.

Depending on the distribution of your nominal values it might also be useful to handle rare values. You could either group them together in a new category (such as "other") or to use a feature selection algorithm after you apply the dummy encoding.

See the screen shot for a sample RapidMiner process (which uses the Replace Rare Values operator from the Operator Toolbox extension).

Edit: Martin is also right, AutoModel will be a good start to check for problematic attributes and find a fitting algorithm.

Sample RapidMiner process for handling nominal values

David
  • 792
  • 5
  • 17