3

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:

  • Value transformed to 0 if the point is NaN
  • Value transformed to 1 if the point is negative or 0
  • Value transformed to 2 if it falls outside certain criteria based on the entire column
  • Value is 3 otherwise

Here is a very simple self-contained example:

import pandas as pd
import numpy as np

df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})

print(df)

enter image description here

The transformation process created so far:

#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
    tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
    outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')

outliers[outliers.isnull() == False] = 2


#Classify everything else as "3"
df[df > 0] = 3

#Classify negative and zero points as a "1"
df[df <= 0] = 1

#Update with the outliers
df.update(outliers)

#Everything else is a "0"
df.fillna(value=0, inplace=True)

Resulting in:

enter image description here

I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.

Community
  • 1
  • 1
Clayton
  • 1,525
  • 5
  • 19
  • 35

1 Answers1

4

Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.

>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )

    a   b
0 NaN   2
1   2   3
2   3  -4
3   4   5
4   5   6
5   0   2
6  -7   7
7   9   9
8  10 NaN

You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.

>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2

You could also do this, which is faster than apply but slower than np.where:

>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2

Of course, these things don't always scale linearly, so test them on your real data and see how that compares.

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • For the outlier part, I only want values to be replaced with `2` when they meet the conditional statement only for their column, **not the whole dataframe** -- I think your solutions use the whole dataframe? – Clayton Jun 23 '15 at 19:22
  • 1
    @cmiller8 No, it's per column. Type `df.mean()` and you will see that it gives you a mean for each column. You could also try some different sample data to test it. – JohnE Jun 23 '15 at 21:32
  • you're right! AND your method is 300x faster a 10k column, 25k row dataframe – Clayton Jun 23 '15 at 22:36