3

I'm trying to remove outliers from a dataset. In order to do that, I'm using:

df = df[df.attr < df.attr.mean() + df.attr.std()*3]

That seems to work as expected, but, when I do something like:

for i in xrange(df.shape[0]):
    print df.attr[i]

Then I get a KeyError. Seems like Pandas isn't actually returning a new DataFrame with rows dropped. How do I actually remove those rows, and get a fully functional DataFrame back?

MaiaVictor
  • 51,090
  • 44
  • 144
  • 286

2 Answers2

2

I think need DataFrame.ix:

for i in xrange(df.shape[0]):
    print df.ix[i, 'attr']

Or Series.iloc:

for i in xrange(df.shape[0]):
    print df.attr.iloc[i]

Simplier solution with Series.iteritems:

for i, val in df.attr.iteritems():
    print (val)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I'm tempted to accept your answer since it is actually the best solution on my case, but someone Googling those keywords might actually need to drop the rows (for different reasons) so I'll accept the other one. – MaiaVictor Nov 12 '16 at 23:17
  • I am a bit surprised, I think [`boolean indexing`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) is better as drop, but it is up to you. good luck :) – jezrael Nov 13 '16 at 08:32
2

First, find the indices which meet the criteria (which in your case is df.attr < df.attr.mean() + df.attr.std()*3).

x = df.loc[:,attr] < df.attr.mean() + df.attr.std()*3

Next, use DataFrame.drop.

df.drop(x[x].index)

See answers such as How to drop a list of rows from Pandas dataframe? for more information

Community
  • 1
  • 1
wwl
  • 2,025
  • 2
  • 30
  • 51