Pandas using the previous rank values to filter out current row

Question

As the title states I am trying to use the previous rank to filter out the current

Here's an example of my starting df:

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 2.95, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  2.95
# 4     3  4  5  3.00
# 5     3  2  5  6.00

Here's what I want the output to be:

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

Basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (+- .1) to remove it.

So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (.9-1.1) OR x = (2-5), y = (3-5), z= (2.9-3.1) I want it to be removed.

Shouldn't the last row be kept? The condition on z is not met — mozway, Sep 14 '21 at 14:00

mozway · Accepted Answer · 2021-09-14T14:45:51.963

2

This is a bit tricky as your need to access the previous group. You can compute the groups using groupby first, and then iterate over the elements and perform your check with a custom function:

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d1.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1 for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,0.1]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

output:

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

edited Sep 14 '21 at 14:45

answered Sep 14 '21 at 13:56

mozway

194,879
13
39
75

can you explain what you're doing in your function. I am kinda lost lol – mike_gundy123 Sep 14 '21 at 14:32
@mike_gundy123 I commented the code, let me know if you have question – mozway Sep 14 '21 at 14:45
okay thanks for the explanation, it definitely helped! One last question, my real data set has extra columns that are not important for comparison. What do with those in the function? Do I just continue to ignore them in the []? – mike_gundy123 Sep 14 '21 at 14:49
as my code is only computing a mask, extra columns shouldn't affect the process – mozway Sep 14 '21 at 14:50
@mike_gundy123 I provided an answer, but this is almost the same ;) – mozway Sep 21 '21 at 14:10

Pandas using the previous rank values to filter out current row

1 Answers1

Linked