1

I have an excel file with feature0 to feature249 and all are floating point numbers (Total of 250 features and 7000 data points). And label column with corresponding class value. There are 5 unique classes (0-4). There is no data dictionary available. I have to train the model(s) on train.csv and compute the most probable class label on data from test.csv. using Python.

Question 1: Which algorithm I can use in Python to start with, since I am new to it. Is there any template or github link where I can reuse the code? I observed the distribution of data in all classes is homogeneous.

Question 2: Which package can I use to select important variables out of 250. since I will be training on my local.

Question 3: How can I check the distribution of each variable? So that I can remove outlier and null values from data. Any package in Python which does that automatically?

My Findings:

I am trying to start with this link: http://scikitlearn.org/stable/modules/neural_networks_supervised.html#classification

In this line

  scaler.fit(X_train)

What is the type of X_train, is it numpy array. Since I have values in excel file do I to bring it in NUmpy format?

Note: Since I am new to multiclass classification problem so I do not have solution to post. Any help would be appreciated rather than giving "-1"

DreamerP
  • 198
  • 1
  • 2
  • 15

1 Answers1

1

You can choose your algorithm depending on what you need through this link: python/scikit-learn

There are detailed descriptions of algorithms as well as usage examples on the website.

For your other needs, you can use Pandas and Numpy modules:

pandas/fillna

stackoverflow/detect-and-exclude-outliers-in-pandas-dataframe

Use pandas.read_csv to read the data from a csv file.

M. Yasin
  • 66
  • 6
  • This is not giving appropriate result, please see my output in https://stackoverflow.com/questions/49514425/removing-outliers-from-pandas-data-frame-using-percentile – DreamerP Mar 27 '18 at 13:55