I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X. Using these values, I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'), ('50','Female','10:40','Sunday','50','6') ........ I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X. How can I achieve this ? Will a decision tree serve the purpose? Is there any other method? I'm new to ML so please forgive me for any vocabulary errors.
2 Answers
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)

- 606
- 5
- 7
-
Thanks a lot. That was helpful. I have converted the continuous data to discrete sets. Also, since we do not have any real data yet, evaluation seems to be getting a little bit tough. I shall try and implement KNN today sounds interesting. Any ideas how I can generate the valuation test set? – user889789 Apr 06 '14 at 10:24
-
There are many ways to split your complete data set into training and validation sets. For now, try a ratio of 80/20 for splitting data into training/validation. Later, look into hold out validation and k-fold cross validation. Also check [this](http://stackoverflow.com/q/13610074/3155701) which addresses some interesting points about choosing validation sets. Good luck! – user3155701 Apr 06 '14 at 13:15
-
I have certain attributes such as (Age, Sex, Time, Day, %Discount ,Category, Buy Class). So the sample looks like (21,Male, 9.30, Thursday, 30%, Sports, Yes) for a buy and (21,Male, 9.30, Thursday, 30%, Sports, No) if the user doesn't buy it. However, we do not have real purchase data. So how should I generate the training data set? – user889789 Apr 12 '14 at 13:52
-
If you can come up with a Bayes Net (BN) that represents your model, you can set parameters (probabilities) of the BN and sample from that BN to generate data. In this case data will be generated based on the probability distributions and will reflect correlations among different variables. There are other ways to generate data from existing data, but you have to have some data to being with. These methods include: adding noise to parts of data or other kinds of transformations to it that your model does not care about (or wants to be invariant of). – user3155701 Apr 12 '14 at 14:54
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

- 66,182
- 23
- 141
- 173
-
Thanks a lot. Helps answer my question. However, things do get a little messy when there are 20 people with a probability of 1.0. – user889789 Apr 05 '14 at 14:12
-
Thanks a lot Sean. Helps answer my question. However, things do get a little messy when there are >5 people with a probability of 1.0 (occurring due to the small training sample size). Is there a better way to do this? If i wanted to predict the likelihood of buying multiple items, what is the way to do it? – user889789 Apr 05 '14 at 14:19
-
That indicates serious overfitting. Use more trees, more data, or build shallower trees. – Sean Owen Apr 05 '14 at 15:05
-
[main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Creating FileDataModel for file /home/rachana/dataset.csv Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.io.Files.getFileExtension(Ljava/lang/String;)Ljava/lang/String; at org.apache.mahout.common.iterator.FileLineIterator.getFileInputStream(FileLineIterator.java:118) at org.apache.mahout.common.iterator.FileLineIterator.
(FileLineIterator.java:79). Hi Sean I'm getting this error while trying to run the Mahout recommender tutorial – user889789 Apr 07 '14 at 15:00 -
-
I have certain attributes such as (Age, Sex, Time, Day, %Discount ,Category, Buy Class). So the sample looks like (21,Male, 9.30, Thursday, 30%, Sports, Yes) for a buy and (21,Male, 9.30, Thursday, 30%, Sports, No) if the user doesn't buy it. However, we do not have real purchase data. So how should I generate the training data set? And how do I calculate the false positive and false negatives in this as I do not know if my test sample has purchased a particular product or not. – user889789 Apr 12 '14 at 13:54