4

I am following this link to remove outliers, but something is logically wrong here..

Remove Outliers in Pandas DataFrame using Percentiles

I have a dataset with first column as "id" and last column as "label".

Here is my piece of code I am removing label and id columns and then appending it:

def processing_data(train_data,test_data):
    #computing percentiles.
    low = .05
    high = .95
    filt_df = train_data.loc[:, train_data.columns != 'id']
    filt_df= filt_df.loc[:, filt_df.columns != 'label']
    quant_df = filt_df.quantile([low, high])
    print(quant_df)

    #filtering values based on computed percentiles. To do that use an apply by columns.
    print("Before removing outlier",filt_df,filt_df.shape)
    train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0)
    print("After removing outlier,",train_data1,train_data1.shape)
    print(train_data1.isnull().sum())
    train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1)
    train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1)
    #train_data.dropna(inplace=True)

    #train_data.fillna(0)
    #test_data.fillna(0)
    #print(train_data)
    #print(np.isnan(train_data).any().sum())
    return train_data,test_data

Output: All the rows contain some NaN values and when I do train_data.dropna(inplace=True) all the rows are dropped. Strange!!

How can I fix this? When I concat id and label column after outlier treatment, I feel something is fishy there?

Here is the dataset:

id  feature0    feature1    feature2    feature3    feature4    feature249  label
0   25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678   1
1   -86.93144987    0.428227194 2.87483597  -8.064850183    6.056867093     2 
2   42.16093367 7.85701304  151.6127571 9.639675583 5.570138511             0
3   20.66694385 8.680641918 -56.44917913    -9.814779803    -2.382979151    1
4   35.9466789  4.57373573  -28.16021186    -6.91297056 4.879375409         0
jpp
  • 159,742
  • 34
  • 281
  • 339
DreamerP
  • 198
  • 1
  • 2
  • 15
  • Are you trying to `dropna()` or `fillna()` (you have both)? `dropna()` has the the default paramter `'any'` and `axis = 0` which means if any row contains `NaN`, it gets dropped. You could try changing `'any'` to `'all'` but I doubt that's what you're looking for. – Ian Thompson Mar 27 '18 at 14:26
  • That was just to explore the output. I just want to remove outliers from the dataset. – DreamerP Mar 27 '18 at 14:32
  • 2
    Can you post some sample starting data and what you want it to look like? – Ian Thompson Mar 27 '18 at 14:33
  • 1
    An [mcve], including some data that runs with your function, would be helpful here. – jpp Mar 27 '18 at 15:01
  • Any link where I can upload the complete dataset? This will help a lot. – DreamerP Mar 27 '18 at 16:14
  • 1
    @DreamerP You do not need to upload a complete dataset. If you have excel, just copy first 15-20 rows of data and paste it here and format it as code. That should be enough. – Floydian Mar 27 '18 at 16:25
  • Added the data sample – DreamerP Mar 27 '18 at 17:50
  • @DreamerP, so what's `train_data` and `test_data`? [In numeric example, not words please.] Also it seems like `feature249` is not in your data example, so don't include the heading. Please can you be more precise about inputs? – jpp Mar 27 '18 at 17:58
  • You can ignore the test data, Run it on the train data that I have pasted. To make you clear there are 250 features. Since I cannot paste all the features here, I have pasted just 4 features. That is the reason I have put ------ feature249 :) I can attach original dataset if you have some link. – DreamerP Mar 27 '18 at 17:59

1 Answers1

0

When I ran your code with your example I got an ValueError. I found this issue which mentions that with float dataframe elements quantile has erratic behavior where it either returns NaNs or a ValueError https://github.com/pandas-dev/pandas/issues/14564 . I think in this case it is the 249 column which is int while rest are floats. when I filt_df = pd.DataFrame(filt_df, dtype=float) to force all the columns to floats, it ran fine.

The NaNs in each row are what are put in place when you filtered by low and high. Each row in teh example does have at least one value that was outside your .05/.95 boundaries (your data may be much more flattened out than you think). This means that when you dropna and it defaults to 'any' all rows will be removed. You can change the way dropna operates by changing 'any' to 'all' or other option. Probably better to adjust your upper/lower bounds to be more in line with your data's spread. Remember even though your bounds are pretty exclusive with each added column it becomes more and more likely that there will be at least one value in each row that falls outside those bounds.

user85779
  • 334
  • 2
  • 11