0

I am having a problem in doing my end of studies project. I have labeled data with a lot of variables (Y is a continous variable in my case) and when I will do prediction I have just a small amount of variables that I can use (say 5 or 6) so when I want to train models on the labeled data that I have results are not good enough because there are not many predictors on the data that I will be using for the prediction. I am not sure how to proceed? Would I use like clustering techniques ? or semi supervised learning ? I am not very familiar of techniques of the latter type of learning but I don't think it is really my use case. I have confidential dataset so I am using the boston dataset as an example:

import pandas as pd

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", header=None, delimiter=r"\s+", names=column_names)
data.head()

So suppose I have this dataset. But in the prediction phase (when the company wants to use the model) it has to use only these four variables as input: ['CRIM', 'ZN', 'INDUS', 'CHAS']. Training the model on only those 4 variables will surely lead to bad results because there are just not enough predictors for the model to learn. So I am asking if there is a way to take advantage of the other variables I have in my training set. Or another technique that involves clustering or groupement of observations like in this example : The houses that have kind of the same values of variables they will have the same price.

I just started the project and I haven't really tried much on the data, but if you have any useful resssources on the problem I will be very grateful for the help.

Abdelkabir
  • 11
  • 2
  • When you say variables do you mean categories? Are you trying to discretize a continuous `y` into different categories? So for example if y can be any number between 0 and 100, you want that range [0, 20] categorized as A, (20, 60] as B and (60, 100] as C? Something like this? – Sembei Norimaki Mar 13 '23 at 09:13
  • Let me rephrase to see if I understand correctly: Not all of the variables which are available for training are available for the prediction? – DataJanitor Mar 13 '23 at 09:17
  • Clustering is unsupervised, so if you don't have information about the final category of each point that's the way to go. If you know for each datapoint which category should be assigned, then you should use a classification algorithm. There are many and depending on your application some will work better than others. If you can provide more information and an example of your data we may be able to further help you. – Sembei Norimaki Mar 13 '23 at 09:23
  • @SembeiNorimaki no Y is a continous variable. It is a regression problem. the problem is that i have multiple variables that are very important for the model to learn like 20 or more variables but i am allowed to use only 5 or 6 variables in the prediction phase so obviously if i use only those 5 variables in my regression probem I will get very poor results. So what i have in mind is that i will try to do some kind of clustering on all the traing data and validate with the Y variable my analysis and then use only the few variables in the prediction phase using the results of my analysis. – Abdelkabir Mar 13 '23 at 09:33
  • @Jan yes exactly! – Abdelkabir Mar 13 '23 at 09:33
  • Ok, when you say variables you mean features. You have to use the same features in the training and in the test. If you have let's say 20 but you only want to use 4 you can use dimensionality reduction techniques (take a look at a simple one like PCA) to reduce the number of features and use the top 4. But you need to use exactly the same ones and in the same order for training and testing. Don't train with more features that you will use in the test. This will only give you false high results in the train which will drop in the test. – Sembei Norimaki Mar 13 '23 at 10:16
  • Please [don’t post images of code, error messages, or other textual data.](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) – tripleee Mar 13 '23 at 10:19
  • @SembeiNorimaki but i cant do PCA because the variables i must use in the prediction phase are fixed and i dont get to choose them. – Abdelkabir Mar 13 '23 at 10:28
  • @SembeiNorimaki but i am looking for a way to get the groups of data ans assign to each group a range of my Y variable. – Abdelkabir Mar 13 '23 at 10:28
  • if the features of the prediction are fixed and are the only ones available, then you should train your model with only this features. You cannot train with features that wont be available in the test set. If you do, the model will rely on this information which won't be available. – Sembei Norimaki Mar 13 '23 at 10:35
  • Ok i will search more for other techniques that may be useful. Thanks for the replies – Abdelkabir Mar 13 '23 at 10:48

0 Answers0