I have an excel file with feature0 to feature249 and all are floating point numbers (Total of 250 features and 7000 data points). And label column with corresponding class value. There are 5 unique classes (0-4). There is no data dictionary available. I have to train the model(s) on train.csv and compute the most probable class label on data from test.csv. using Python.
Question 1: Which algorithm I can use in Python to start with, since I am new to it. Is there any template or github link where I can reuse the code? I observed the distribution of data in all classes is homogeneous.
Question 2: Which package can I use to select important variables out of 250. since I will be training on my local.
Question 3: How can I check the distribution of each variable? So that I can remove outlier and null values from data. Any package in Python which does that automatically?
My Findings:
I am trying to start with this link: http://scikitlearn.org/stable/modules/neural_networks_supervised.html#classification
In this line
scaler.fit(X_train)
What is the type of X_train, is it numpy array. Since I have values in excel file do I to bring it in NUmpy format?
Note: Since I am new to multiclass classification problem so I do not have solution to post. Any help would be appreciated rather than giving "-1"