I have a numpy array looking like this:
[[100, 1, 0.01, '5'], [50, 2, 0.02, '3'], [4000, 1, 0.01, '3']
And I'm trying to do 2 things: to normalize the data of the 3 first columns and to remove rows that have outliers in the 3 first columns (so to keep the 4th one intact, as a string).
I already have a function to normalize data I took from here: Normalize numpy array columns in python
And I already have a function to remove a complete rows when one of its data is an outlier, that I took from here: Removing outliers in each column (and corresponding row)
But the function is normalizing all the columns and I dont want it to affect the last one. So I tried temporary removing the last column and putting it back that way:
temp_col = np.take(a, [3], axis=1)
a = np.delete(a, [3], axis=1)
a = a.astype(np.float)
a = remove_outliers(a, 6)
a = normalize_data(a)
a = np.append(a, temp_col, axis=1) #wont work
And these are the methods used (taken from the sources I put above):
def normalize_data(a):
return a / a.max(axis=0)
def remove_outliers(self, a, m):
mask = np.ones((a.shape[0],), dtype=np.bool)
mu, sigma = np.mean(a, axis=0), np.std(a, axis=0, ddof=1)
for j in range(a.shape[1]):
col = a[:, j]
mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < m
return a[mask]
but now the problem is that when I remove the outliers rows, the length of my temporary column doesn't match the array size anymore, so I can't append it back.
Does anyone have a solution for this problem? Should I do it the hard way and save an index of the rows that were removed because of an outliers and then remove it in my temp_col?
Thank you very much!