0

As the title states I am trying to use the previous rank to filter out the current

Here's an example of my starting df:

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 2.95, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  2.95
# 4     3  4  5  3.00
# 5     3  2  5  6.00

Here's what I want the output to be:

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

Basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (+- .1) to remove it.

So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (.9-1.1) OR x = (2-5), y = (3-5), z= (2.9-3.1) I want it to be removed.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
mike_gundy123
  • 469
  • 5
  • 18

1 Answers1

2

This is a bit tricky as your need to access the previous group. You can compute the groups using groupby first, and then iterate over the elements and perform your check with a custom function:

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d1.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1 for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,0.1]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

output:

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0
mozway
  • 194,879
  • 13
  • 39
  • 75