Python - removing outliers while keeping a column intact

Question

I have a numpy array looking like this: [[100, 1, 0.01, '5'], [50, 2, 0.02, '3'], [4000, 1, 0.01, '3']

And I'm trying to do 2 things: to normalize the data of the 3 first columns and to remove rows that have outliers in the 3 first columns (so to keep the 4th one intact, as a string).

I already have a function to normalize data I took from here: Normalize numpy array columns in python

And I already have a function to remove a complete rows when one of its data is an outlier, that I took from here: Removing outliers in each column (and corresponding row)

But the function is normalizing all the columns and I dont want it to affect the last one. So I tried temporary removing the last column and putting it back that way:

temp_col = np.take(a, [3], axis=1)
a = np.delete(a, [3], axis=1)
a = a.astype(np.float)
a = remove_outliers(a, 6)
a = normalize_data(a)
a = np.append(a, temp_col, axis=1) #wont work

And these are the methods used (taken from the sources I put above):

def normalize_data(a):
    return a / a.max(axis=0)

def remove_outliers(self, a, m):
    mask = np.ones((a.shape[0],), dtype=np.bool)
    mu, sigma = np.mean(a, axis=0), np.std(a, axis=0, ddof=1)
    for j in range(a.shape[1]):
        col = a[:, j]
        mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < m
    return a[mask]

but now the problem is that when I remove the outliers rows, the length of my temporary column doesn't match the array size anymore, so I can't append it back.

Does anyone have a solution for this problem? Should I do it the hard way and save an index of the rows that were removed because of an outliers and then remove it in my temp_col?

Thank you very much!

Remove outliers first, then create you temp column, then normalize, then add temp column back? What's the problem? — avysk, Mar 14 '17 at 15:12
Can you post your code for normalize and remove_outliers? Think its easiest to add a 'skip_column' functionality there. Another option is to use pandas, where we could match on some index. — Roelant, Mar 14 '17 at 15:22
@avysk the thing is that I don't want the remove_outliers to remove rows based on the 4th columns, and also the method I'm using doesn't work if there are strings in it. — Félix Hubert, Mar 14 '17 at 16:16
Make `remove_outliers()` return `mask` instead of `a[mask]`. Then use that mask to modify `a` and `temp_col` equally: `a = a[mask]; temp_col = temp_col[mask]`. Then `np.append()` should work, I think. — Norman, Mar 18 '17 at 03:32

Python - removing outliers while keeping a column intact

0 Answers0