I am having a problem in doing my end of studies project. I have labeled data with a lot of variables (Y is a continous variable in my case) and when I will do prediction I have just a small amount of variables that I can use (say 5 or 6) so when I want to train models on the labeled data that I have results are not good enough because there are not many predictors on the data that I will be using for the prediction. I am not sure how to proceed? Would I use like clustering techniques ? or semi supervised learning ? I am not very familiar of techniques of the latter type of learning but I don't think it is really my use case. I have confidential dataset so I am using the boston dataset as an example:
import pandas as pd
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", header=None, delimiter=r"\s+", names=column_names)
data.head()
So suppose I have this dataset. But in the prediction phase (when the company wants to use the model) it has to use only these four variables as input: ['CRIM', 'ZN', 'INDUS', 'CHAS']
.
Training the model on only those 4 variables will surely lead to bad results because there are just not enough predictors for the model to learn. So I am asking if there is a way to take advantage of the other variables I have in my training set. Or another technique that involves clustering or groupement of observations like in this example : The houses that have kind of the same values of variables they will have the same price.
I just started the project and I haven't really tried much on the data, but if you have any useful resssources on the problem I will be very grateful for the help.