6

This question is very similar to one I posted before with just one change. Instead of doing just the absolute difference for all the columns I also want to find the magnitude difference for the 'Z' column, so if the current Z is 1.1x greater than prev than keep it.

(more context to the problem)

Pandas using the previous rank values to filter out current row

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  3.25
# 4     3  4  5  3.00
# 5     3  2  5  6.00

Here's what I want the output to be

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1z) to remove it.

So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (<1.1) OR x = (2-5), y = (3-5), z= (<3.3) I want it to be removed

mike_gundy123
  • 469
  • 5
  • 18

5 Answers5

5

Here's a solution using numpy broadcasting:

# Initially, no row is dropped
df['drop'] = False

for r in range(df['rank'].min(), df['rank'].max()):
    # Find the x_min, x_max, y_min, y_max, z_max of the current rank
    cond = df['rank'] == r
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
    x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
    y_min, y_max = y + [[-1], [1]]
    z_max        = z * 1.1

    # Find the x, y, z of the next rank. Raise them one dimension
    # so that we can make a comparison matrix again x_min, x_max, ...
    cond = df['rank'] == r + 1
    if not cond.any():
        continue
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]

    # Condition to drop a row
    drop = (
        (x_min <= x) & (x <= x_max) &
        (y_min <= y) & (y <= y_max) &
        (z <= z_max)
    ).any(axis=1)
    df.loc[cond, 'drop'] = drop

# Result
df[~df['drop']]

Condensed

An even more condensed version (and likely faster). This is a really good way to puzzle your future teammates when they read the code:

r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]

drop = (
    (rr == r + 1) &
    (x-1 <= xx) & (xx <= x+1) &
    (y-1 <= yy) & (yy <= y+1) &
    (zz <= z*1.1)
).any(axis=1)

# Result
df[~drop]

What this does is comparing every row in df against each other (including itself) and return True (i.e. drop) if:

  • The current row's rank == the other row's rank + 1; and
  • The current row's x, y, z fall within the specified range of the other row's x, y, z
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Thank you for your input! You showed me a new function I've never seen before. I went with another answer since it was most similar to what I already had – mike_gundy123 Sep 27 '21 at 13:03
2

You need to slightly modify my previous code:

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

output:

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • 1
    Thanks for replying again! but this doesn't exactly work as I image in it. If were to change index 3 `z: 3.31` it wouldn't show up in the output even though 3.31 > 3.00*1.1 – mike_gundy123 Sep 21 '21 at 14:25
1

Just takes an adjustment to the z term of the lamda equation from the linked post:

return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

Here's the full code that works for me:

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 3, 4, 2],
    'y': [0, 4, 0, 4, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
})


def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
cmay
  • 153
  • 6
  • Not working for me, it says `ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` – mike_gundy123 Sep 21 '21 at 13:08
  • okay I think I fixed it, instead of doing `.1*d_prev['z'] I had to do .1*s['z']` – mike_gundy123 Sep 21 '21 at 13:46
  • You might want to check that, I believe that will do 10% of the second z value (which is denoted by "s") as compared to the previous (which is denoted by d_prev). I believe just taking the other answer's code and using d_prev as I've done should give the 10% from the previous line. – cmay Sep 21 '21 at 14:41
  • hmm ya you're right, but `d_prev` is considered an entire df, where as s (from my understanding) is each row in `d` so i can't exactly do `d_prev[['z']]`. Hope that made sense – mike_gundy123 Sep 21 '21 at 15:24
  • Just added the full implementation that works for me with 3.3 and 3.31 excluded and included – cmay Sep 21 '21 at 16:27
  • 1
    Maybe I am just being dense, but this doesn't work for me (I copied and pasted your code). I get an error that says `ValueError: Unable to coerce list of to Series/DataFrame`. Which I presume comes from here: `le([1,1,.1*d_prev['z']])` so I changed it to: `le([1,1,.1*d_prev[['z']]])`. Which gives me the error that I mentioned in my first reply to you. – mike_gundy123 Sep 21 '21 at 16:50
1

This works for me on Python 3.8.6

import pandas as pd

dfg = df.groupby("rank")

def filter_func(dfg):
    for g in dfg.groups.keys():
        if g-1 in dfg.groups.keys():
            yield (
                pd.merge(
                    dfg.get_group(g).assign(id = lambda df: df.index), 
                    dfg.get_group(g-1),
                    how="cross", suffixes=("", "_prev")
                ).assign(
                    cond = lambda df: ~(
                        (df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
                    )
                )
            ).groupby("id").agg(
                {
                    **{"cond": "all"},
                    **{k: "first" for k in df.columns}
                }).loc[lambda df: df.cond].drop(columns = ["cond"])
        else:
            yield dfg.get_group(g)

pd.concat(
    filter_func(dfg), ignore_index=True
)

The output seems to match what you expected:

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     3  2  5  6.0

Small edit: in your question it seems like you care about the row index. The solution I posted just ignores this, but if you want to keep it, just save it as an additional column in the dataframe.

Shffl
  • 396
  • 3
  • 18
1

I have modified mozway's function so that it works according to your requirements.

# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check if a previous group exists, else flag all rows False (i.e. not to be dropped)
        #return pd.Series(False, index=d.index)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if differences in x and y are within 1 and z < 1.1*x
        # for at least one row of the previous group
        # then flag the row to be dropped (True)
        
        return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
                                  (s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)

tests,

>>> df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
})

>>> df

   rank  x  y     z
0     1  0  0  1.00
1     1  3  4  3.00
2     2  0  0  1.20
3     2  3  4  3.25
4     3  4  5  3.00
5     3  2  5  6.00

>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0
>>> df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3, 6],
})

>>> df

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
4     3  4  5  3.0
5     3  2  5  6.0


>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
5     3  2  5  6.0
  • This works perfectly! Also can you explain more about the reason behind using the `DELTA` variable? (also I presume you meant to do ` (s['z']<1.1*d_prev['z']` instead of `(s['z']<1.1*d_prev['x']`) – mike_gundy123 Sep 27 '21 at 12:59
  • @mike_gundy123, thank you for appreciation! I am using DELTA because, floating point values are always approximate, so 1.1*3 ≠ 3.3 exactly, it might be 1.1*3=3.299999999999999, hence the confusion. Anyway please see this answer there you will find much more better explanation of this phenomenon, https://stackoverflow.com/questions/5595425/what-is-the-best-way-to-compare-floats-for-almost-equality-in-python – Arislan Makhmudov Sep 27 '21 at 13:25
  • The second closing parenthesis in (s['z']<1.1*d_prev['x']-DELTA)) is overall wrapping one, please note after it I am applying method 'any()'. And the one next to last is used to calculate arithmetic operation within. So there's no typo there. – Arislan Makhmudov Sep 27 '21 at 13:29
  • oh no not that, I am saying you're comparing 'Z' and 'X' on that line when you should be comparing 'Z' and 'Z' – mike_gundy123 Sep 27 '21 at 13:32
  • @mike_gundy123 but aren't you requiring it? I am citing your question: ------- basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1x) to remove it. -------- z (<1.1x) here z is compared to 1.1*x? – Arislan Makhmudov Sep 27 '21 at 13:35
  • Anyway, you can change x to z, and I believe the code will work fine. – Arislan Makhmudov Sep 27 '21 at 13:36
  • lmaoooooo, that is my bad on my part – mike_gundy123 Sep 27 '21 at 13:38
  • we're all human after all :) – Arislan Makhmudov Sep 27 '21 at 13:41