4

I have a dataset of about 100,000 records about buying pattern of customers. The data set contains

  • Age (continuous value from 2 to 120) but I have plan also to categorize into age ranges.
  • Gender (either 0 or 1)
  • Address (can be only six types or I can also represent using numbers from 1 to 6)
  • Preference shop (can be from only 7 shops) which is my class problem.

So my problem is to classify and predict the customers based on their Age,gender and location for Preference shop. I have tried to use naive and decision trees but their classification accuracy is little bit low below.

I am thinking also logistic regression but I am not sure about the discrete value like gender and address. But, I have also assumed SVM with some kernal tricks but not yet tried.

So which machine learning algorithm do you suggest for better accuracy with these features.

xshaw
  • 311
  • 2
  • 12
  • It is more likely that you need more features, have you tried out `random forests` yet? – Thomas Jungblut Jan 11 '13 at 09:10
  • you are right I short of features, but the data set I have don't have much features to help me out. So I just want to improve the accuracy depending on these features – xshaw Jan 11 '13 at 09:12
  • 1
    This is impossible to answer without at least some further information. How do the features separate the classes in the feature space? How is the distribution of the classes? What is the distribution of the feature values? Even if you posted the entire data set, we could only do what you can do yourself -- try and see what works. – Lars Kotthoff Jan 11 '13 at 09:38

2 Answers2

11

The issue is that you're representing nominal variables on a continuous scale, which imposes a (spurious) ordinal relationship between classes when you use machine learning methods. For example, if you code address as one of six possible integers, then address 1 is closer to address 2 than it is to address 3,4,5,6. This is going to cause problems when you try to learn anything.

Instead, translate your 6-value categorical variable to six binary variables, one for each categorical value. Your original feature will then give rise to six features, where only one will ever be on. Also, keep the age as an integer value since you lose information by making it categorical.

As for approaches, it's unlikely to make much of a difference (at least initially). Go with whichever is easier for you to implement. However, make sure you run some sort of cross-validation parameter selection on a dev set before running on your test set, as all algorithms have parameters than can dramatically affect learning accuracy.

Ben Allison
  • 7,244
  • 1
  • 15
  • 24
  • Thank you nice explanation! @Ben Allison . You said to represent the categorical value using binary value. How about the computation like (KNN) does it has effect on the value represented as binary. – xshaw Jan 23 '13 at 19:30
  • Regarding the ordinal relationship, another point to add: if memory is an issue, you can "crunch" the variable set if needed. E.g., you can choose only 4 variables, and addr 1 --> (1,0,0,0); addr 2 --> (0,1,0,0), etc. But with addr 5 --> (1,1,0,0) and addr 6 --> (0,1,0,1). If you're allowing for interactions, then those are still "orthogonal" to anything with only 1 value equal to 1. Likely, Ben's suggestion is sufficient, though. :) – Mike Williamson Jul 02 '13 at 00:58
1

You really need to look at the data and determine if there is enough variance between your labels and the features that you currently have. Because there are so few features but a lot of data, something such as kNN could work well.

You could adapt collaborative filtering to solve your problem as that would also work off of similar features.

Steve
  • 21,163
  • 21
  • 69
  • 92