4

I have a pandas Dataframe with a 'data' and 'cond'(-ition) column. I need the mean value (of the data column) of the rows with the highest number of CONTINUOUS True objects in 'cond'.

    Example DataFrame:

        cond  data
    0   True  0.20
    1  False  0.30
    2   True  0.90
    3   True  1.20
    4   True  2.30
    5  False  0.75
    6   True  0.80

    Result = 1.466, which is the mean value of row-indexes 2:4 with 3 True

I was not able to find a „vectorized“ solution with a groupby or pivot method. So I wrote a func that loops the rows. Unfortunately this takes about an hour for 1 Million lines, which is way to long. Unfortunately, the @jit decoration does not reduce the duration measurably.

The data I want to analyze is from a monitoring project over one year and I have every 3 hours a DataFrame with one Million rows. Thus, about 3000 such files.

An efficient solution would be very important. I am also very grateful for a solution in numpy.

Divakar
  • 218,885
  • 19
  • 262
  • 358
Stivi B
  • 423
  • 3
  • 10

2 Answers2

3

Using the approach from Calculating the number of specific consecutive equal values in a vectorized way in pandas:

df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

Indexing by a callable requires 0.18.0, for earlier versions, you can do:

res = df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])

res[res['count'] == res['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

How it works:

The first part, df['cond'] != df['cond'].shift() returns a boolean array:

df['cond'] != df['cond'].shift()
Out: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: cond, dtype: bool

So the value is False whenever the row is the same as the above. That means that if you take the cumulative sum, these rows (consecutive ones) will have the same number:

(df['cond'] != df['cond'].shift()).cumsum()
Out: 
0    1
1    2
2    3
3    3
4    3
5    4
6    5
Name: cond, dtype: int32

Since groupby accepts any Series to group on (it is not necessary to pass a column, you can pass an arbitrary list), this can be used to group the results. .agg(['count', 'mean'] part just gives the respective counts and means for each group and at the end it selects the one with the highest count.

Note that this would group consecutive False's together, too. If you want to only consider consecutive True's, you can change the grouping Series to:

((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()

Since we want False's when the condition is True, the condition became 'not equal to the row below OR not True'. So the original line would change to:

df['data'].groupby(((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]
Community
  • 1
  • 1
ayhan
  • 70,170
  • 20
  • 182
  • 203
2

Here's a NumPy based approach -

# Extract the relevant cond column as a 1D NumPy array and pad with False at
# either ends, as later on we would try to find the start (rising edge) 
# and stop (falling edge) for each interval of True values
arr = np.concatenate(([False],df.cond.values,[False]))

# Determine the rising and falling edges as start and stop 
start = np.nonzero(arr[1:] > arr[:-1])[0]
stop = np.nonzero(arr[1:] < arr[:-1])[0]

# Get the interval lengths and determine the largest interval ID
maxID = (stop - start).argmax()

# With maxID get max interval range and thus get mean on the second col
out = df.data.iloc[start[maxID]:stop[maxID]].mean()

Runtime test

Approaches as functions -

def pandas_based(df): # @ayhan's soln
    res = df['data'].groupby((df['cond'] != df['cond'].shift()).\
                                cumsum()).agg(['count', 'mean'])
    return res[res['count'] == res['count'].max()]

def numpy_based(df):
    arr = np.concatenate(([False],df.cond.values,[False]))
    start = np.nonzero(arr[1:] > arr[:-1])[0]
    stop = np.nonzero(arr[1:] < arr[:-1])[0]
    maxID = (stop - start).argmax()
    return df.data.iloc[start[maxID]:stop[maxID]].mean()

Timings -

In [208]: # Setup dataframe
     ...: N = 1000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [209]: %timeit pandas_based(df)
100 loops, best of 3: 2.61 ms per loop

In [210]: %timeit numpy_based(df)
1000 loops, best of 3: 215 µs per loop

In [211]: # Setup dataframe
     ...: N = 10000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [212]: %timeit pandas_based(df)
100 loops, best of 3: 4.12 ms per loop

In [213]: %timeit numpy_based(df)
1000 loops, best of 3: 331 µs per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • The tests I made are all correct. I try to understand what you wrote. Thanks a lot. – Stivi B Oct 16 '16 at 21:18
  • @StiviB Added few comments there. – Divakar Oct 16 '16 at 21:25
  • Very fast solution. I had to update numpy and now it works perfectly. I perform most of the calculations with pandas DataFrames. I wonder if i should use more numpy in general. I will check this. Thanks. – Stivi B Oct 18 '16 at 21:05
  • @StiviB Not an expert on pandas, but I got to post pandas tagged questions with NumPy and with that little experience, I have felt that for number crunching, specially for reductions, NumPy works well. So, maybe keep that in mind! – Divakar Oct 18 '16 at 21:07