8

I'd like to extract the bins for 'S' individually, where each column (X&Y) > 0.5, or multiple bins > 0.5 * 'number of rows'.

In the example;

for 'AR1' should only bin 4 be selected, because 'X' and 'Y' are > 0.5 (blue indicated)

for 'PO1' should bins 1, 2, 3 and 4 be selected, because 'X' and 'Y' are > (4 * 0.5) (yellow indicated).

I've tried this earlier with a for loop, but that didn't work correctly; Selecting multiple (neighboring) rows conditionally

np.random.seed(0)

N = 20
S = ['AR1', 'PO1']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=5, labels=np.arange(5))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

print func(df)

enter image description here

EDIT

What it should look like is the df as shown in the question, but the rows that do not qualify filtered out. What I check for is this ; the values in X and Y > 0.5 for any row(bin) separately or combined. Combinations of rows only consecutively, 2, 3, 4 or 5 rows combined.

I.e, the combinations of rows for 0 are then ; 0+1, 0+1+2, 0+1+2+3 and 0+1+2+3+4. For 1 ; 1+2, 1+2+3 and 1+2+3+4 etc.

Multiple rows would sum to the number of rows x 0.5, X and Y would have to be > 2.5 for rows 0 to 4 for example.

EDIT2: @JohnE and piRSquared, both your solutions work, which one however would work better when there are other columns in the dataframe that should not be evaluated?

In addition, what if I would want to addin an additional condition in your solutions?

EDIT3: @piRSquared, When subsetting some columns I only get those returned, where I would need all of them, not only the subsetted.

Could you assist? Thanks.

Community
  • 1
  • 1
Zanshin
  • 1,262
  • 1
  • 14
  • 30

1 Answers1

3

This is a vectorized approach with only one loop at the top level (groupby.apply)

# columns that I care about
cols = ['X', 'Y']
df1.groupby(level=0)[cols].apply(find_window)

enter image description here


def find_window(df):
    v = df.values
    s = np.vstack([np.zeros((1, v.shape[1])), v.cumsum(0)])

    threshold = .5

    r, c = np.triu_indices(s.shape[0], 1)
    d = (c - r)[:, None]
    e = s[c] - s[r]
    mask = (e / d > threshold).all(1)
    rng = np.arange(mask.shape[0])

    if mask.any():
        idx = rng[mask][d[mask].argmax()]

        i0, i1 = r[idx], c[idx]
        return pd.DataFrame(
            v[i0:i1],
            df.loc[df.name].index[i0:i1],
            df.columns
        )

Explanation

strategy

  • numpy.triu_indices: I need to evaluate each possible window for the rolling mean greater than some threshold. I'm going to capture each possible window by starting from position 0 to 0, then 0 to 1 then ... then 1 to 1, 1 to 2 ... so on and so forth. But I must always start at a position before I finish. I can access these combinations with numpy.triu_indices.
  • cumsum: It would be a little tricky (doable) to get the expanded arrays specified by each combination of indices that I get from np.triu_indices. A better way is to calculate the cumsum and take the difference from one index to the next.
  • I have to prepend zeros to my cumsum so that I can take the difference for the first row.
  • But sums are not means. I need to divide by the number of rows to get the means. Conveniently, the difference between the end and start positions is exactly the number of rows and thus the appropriate number to divide the sums by in order to calculate the means.
  • Now that I have the means, e / d, I check which are > threshold and identify which combinations of start and end positions have means greater than the threshold for both columns.
  • I then identify the combination with the greatest number of rows among those that have means greater than the threshold.
  • I unwind the positions and reconstruct a dataframe
  • groupby and apply... QED

time test

enter image description here


with more data

np.random.seed(0)

N = 300
S = ['AR1', 'PO1', 'AR2', 'PO2', 'AR3', 'PO3']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=20, labels=np.arange(20))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

df1 = func(df)

The time difference is even more dramatic

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks, this works. I've put in an edit. I need to use either your solution or JohnE's on my real data_set. This set however has a few additional hurdles for me. – Zanshin Jan 28 '17 at 10:23
  • 1
    Nice! I erased mine as I knew it was quite slow and you included my code in the timings anyway, so if anyone cares it is there. I had a feeling that this would not be easy to do in numpy, so I'm weirdly happy to see it was not easy! – JohnE Jan 28 '17 at 13:12
  • I can appreciate the sentiment ;-) – piRSquared Jan 28 '17 at 13:13
  • @Zanshin I've updated my post to accommodate subsetting. – piRSquared Jan 28 '17 at 15:32
  • @piRSquared, if I want to evaluate for an additional column 'total' with a different threshold. How would amend your solution? 'total' would be 'X' + 'Y' – Zanshin Jan 29 '17 at 10:41
  • @piRSquared, I've tried your solution. I've put in an edit. – Zanshin Jan 30 '17 at 04:20
  • @Zanshin I've rerun my code with varying conditions and I can't replicate your error. I can not guess what your issue is. I also cannot continue to help without you providing adequate data to reproduce the issues you are having. – piRSquared Jan 30 '17 at 04:46
  • I understand. Could you maybe add a short explanation per code line, so I can maybe solve/change it myself? – Zanshin Jan 30 '17 at 10:03
  • @piRSquared, I've added a new question: http://stackoverflow.com/questions/42094873/incorrect-results-when-applying-solution-to-real-data. Would you mind taking a look, thanks. – Zanshin Feb 08 '17 at 11:12