0

I have a dataframe with a length of 1168 and 270 columns.

The goal is the title: Remove outliers from all 270 columns that are more than +/- 3 standard deviations away from the mean.

My code is the following. However, it only keeps 40 datapoints. This doesnt make sense since it originally has 1168 rows, which means its only keeping 3% of the entire dataset.

from scipy import stats
len(df[(np.abs( stats. zscore(df)) < 3). all(axis = 1)])
Katsu
  • 8,479
  • 3
  • 15
  • 16

1 Answers1

0

I think I can tell you what's wrong, at least: .all(axis=1) collapse the columns of a matrix to a row vector, with true values if all elements of the corresponding column in the input matrix are true. Meaning that you have 40 columns containing only values within +-3 std. See docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html

I think this should work:

df[np.abs(stats.zscore(df)) < 3)].count()
  • That looks like it just keeps all rows (len = 1168). I even flipped the sign to > 3 to see if it was 0, but without the all it just keeps all rows still (len = 1168) – Katsu Feb 01 '23 at 22:29
  • Change len to sum? Or sum(sum(...)). https://stackoverflow.com/questions/12765833/counting-the-number-of-true-booleans-in-a-python-list – user9605929 Feb 02 '23 at 04:09