3

So I have a training set and the domain of one of the attributes is the following:

A = {Type1, Type2, Type3, ... ,Type5}

If the domain remains in that form I can't apply linear regression, because the mathematical hypothesis cant possibly work e.g.:

H = TxA + T1xB + T2xC + ...

(that is if we assume that all of the attributes are numerical except for the A attribute, then you cannot multiply a real-value parameter with a type )

Can I substitute the domain with numerical, equivalent, discrete values so I can do Linear Regression for this problem and be ok ?

A = {1, 2, 3, ...., 5 )

Is this the best practice ? If not can you please give me an alternative in those situations ?

SdSdsdsd
  • 123
  • 10
  • A bit off-topic, vote on questions [here](http://area51.stackexchange.com/proposals/57719/artificial-intelligence) or perhaps try asking [here](http://stats.stackexchange.com/) – BartoszKP Oct 22 '13 at 08:50

2 Answers2

5

Best practice is to do a one-hot (one-of-K) encoding: for each value that A can take on, define a separate indicator feature. So with fives "types", A = type1 would be

[1, 0, 0, 0, 0]

and A = type3 is

[0, 0, 1, 0, 0]

Then concatenate these vectors with your other features so that your hypothesis becomes

H = w[Atype1] * [A=type1] + ... + w[Atype5] * [A=type5] + w[B] * B + ...

using [] to denote indicator functions.

This avoids the main problem with your approach, which is that you're introducing a number of (probably incorrect) biases, e.g. that type5 = type2 + type3. For further intuition why this is better than your encoding, see this answer of mine.

Community
  • 1
  • 1
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • You could perhaps be more clear about whether this eliminates the regression approach or not. Having vectors as outputs forces you to use M (output vector size) linear regressions, doesn't it? – BartoszKP Oct 22 '13 at 09:02
  • @BartoszKP: I finally get why you think this is a classification problem. But regression on categorical variables does occur sometimes (particularly in social science applications), so I'm still assuming the OP really wants to do regression. If not, then indeed, multiple LR is needed, although logistic regression is probably a better idea. – Fred Foo Oct 22 '13 at 09:06
  • Well I never heard of it, so thanks for pointing that out with an example, I'll look into it. Logistic regression seems like a good idea! I'll add this to my answer if you don't mind, you should probably update yours with it also ;) – BartoszKP Oct 22 '13 at 09:10
0

In general this won't work, because usually an average of nominal attributes doesn't make sense. For example if you assign Apple = 1, Banana = 2, Orange = 3 then in the model Banana would appear as an average of an Apple and an Orange. For classification tasks, consider using a perceptron, a neural network (using Winner-take-all paradigm eliminates the problem with average between nominal attributes), a decision tree or some other tools I forgot to mention. As correctly pointed out by larsmans a typical model for your case is the Logistic Regression.

Possibly you could also use WTA paradigm for linear regression - building a regression model for each of the output vector dimensions.

Clarification: WTA is the same as one-hot in larsmans's answer.

Community
  • 1
  • 1
BartoszKP
  • 34,786
  • 15
  • 102
  • 130
  • I don't know why you mention classification; parametric classification models such as perceptrons and NNs suffer from the same problem as linear regression. – Fred Foo Oct 22 '13 at 08:58
  • @larsmans It's possible that it's not a classification, but if input space is nominal then usually it is. Perceptron won't suffer the same problem, because its output is binary. For NNs I've mentioned that WTA needs to be used, so it will also deal with this issue. – BartoszKP Oct 22 '13 at 09:01