0

[Dataframe]

Hi ,

Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column. So that i can use this complete data for preparing the datascience models.

Thanks,

EdChum
  • 376,765
  • 198
  • 813
  • 562
iahmed
  • 3
  • 2
  • 3
    What did you try ? You should read [how to ask a good question](https://stackoverflow.com/help/how-to-ask) – Lescurel Sep 12 '17 at 10:08
  • Welcome to SO, firstly don't post a link to an image. secondly did you see this: https://stackoverflow.com/questions/9365982/missing-values-in-scikits-machine-learning. You need to decide what to do with missing data, the options are drop the rows, impute them, set them to some dummy value that doesn't skew the data – EdChum Sep 12 '17 at 10:09
  • also, if you post your data as text rather than an image people can copy/paste it to test whether their answers work, otherwise you'll get guesses. `print df.to_string` is more useful because then responders can `pd.read_clipboard` to replicate your dataframe. – Stael Sep 12 '17 at 10:10
  • i tried this code:- can you please let me know if this is correct "from sklearn.preprocessing import Imputer imputer=Imputer() imputer=Imputer(missing_values='NaN',strategy='median',axis=0) d1=pd.read_csv('Train.csv') X=d1.iloc[:8].values" – iahmed Sep 12 '17 at 10:12
  • @EdChum@Stael@Lescurel Thanks for your advise, i will keep these in further posts, also i have now added my code of imputation.. i basically dont want to drop off these rows – iahmed Sep 12 '17 at 10:15
  • add these things to the question, also, what did that code give you and why is that not what you wanted. – Stael Sep 12 '17 at 10:17

3 Answers3

4

These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!

Some guidelines about missing data.

A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).

B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.

C. Missing data can be classified into two types:

MCAR: missing completely at random. This is the desirable scenario in case of missing data.

MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear? Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function

D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean())) This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!

E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.

0

The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.

Since it is categorical data use Measures of Central Tendency, Mode. Use mode to find which category occurs more or frequently and fill the column with the corresponding value.

Code:

Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)
Ailurophile
  • 2,552
  • 7
  • 21
  • 46
0

The accepted answer is really nice.

In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)

Adept
  • 522
  • 3
  • 16