1

For each row, I'm trying to compute the standard deviation for the other values in a group excluding the row's value. A way to think about it is "what would the standard deviation for the group be if this row's value was removed". An example may be easier to parse:

df = pl.DataFrame(
    {
        "answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
        "value": [5,10,7,8,6,9,10],
    }
)
┌────────┬───────┐
│ answer ┆ value │
│ ---    ┆ ---   │
│ str    ┆ i64   │
╞════════╪═══════╡
│ yes    ┆ 5     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 10    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 7     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 8     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 6     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 9     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 10    │
└────────┴───────┘

I would want to add a column that would have the first row be std([10,7,8]) = 1.527525

I tried to hack something together and ended up with code that is horrible to read and also has a bug that I don't know how to work around:

df.with_column(
    (
        (pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
        / (pl.col("value").count().over(pl.col("answer")) - 1)
    ).alias("average_other")
).with_column(
    (
        (
            (
                (pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
                - (pl.col("value") - pl.col("average_other")).pow(2)
            )
            / (pl.col("value").count().over(pl.col("answer")) - 1)
        ).sqrt()
    ).alias("std_dev_other")
)

I'm not sure I would recommend parsing that, but I'll point out at least one thing that is wrong: pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer")) I want to be comparing "value" in each row to "average_other" from this row then squaring and summing over the window but instead I am comparing "value" in each row to "average_other" in each row.

My main question is the "what is the best way to get the standard deviation while leaving out this value?" part. But I would also be interested if there is a way to do the comparison that I'm doing wrong above. Third would be tips on how to write this in way that is easy to understand what is going on.

pwb2103
  • 184
  • 2
  • 12

3 Answers3

3

The way I'd come at this (at least at first thought) is create three helper columns. The first being a row index, the second being a window list of the values in the group, and the last is a windowed list of the row index. Next I'd explode by the two aforementioned lists. With that you can filter out the rows where the actual row index is equal to the list row index. That allows you to run std against the values by the row index where we've filtered out the own value on each row. You can join that result back to the original.

df = pl.DataFrame(
    {
        "answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
        "value": [5,10,7,8,6,9,10],
    }
)

df.with_row_count('i').join(
    df.with_row_count('i') \
        .with_columns([
            pl.col('value').list().over('answer').alias('l'), 
            pl.col('i').list().over('answer').alias('il')]) \
        .explode(['l','il']).filter(pl.col('i')!=pl.col('il')) \
        .groupby('i').agg(pl.col('l').std().alias('std')),
    on='i').drop('i')

shape: (7, 3)
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ std      │
│ ---    ┆ ---   ┆ ---      │
│ str    ┆ i64   ┆ f64      │
╞════════╪═══════╪══════════╡
│ yes    ┆ 5     ┆ 1.527525 │
│ yes    ┆ 10    ┆ 1.527525 │ 
│ yes    ┆ 7     ┆ 2.516611 │
│ yes    ┆ 8     ┆ 2.516611 │
│ maybe  ┆ 6     ┆ 0.707107 │
│ maybe  ┆ 9     ┆ 2.828427 │
│ maybe  ┆ 10    ┆ 2.12132  │
└────────┴───────┴──────────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
1

I came up with something similiar to @DeanMacGregor's answer:

df = (
    df.with_row_count()
    .join(df.with_row_count(), on="answer")
    .filter(pl.col("row_nr") != pl.col("row_nr_right"))
    .groupby(["answer", "row_nr_right"], maintain_order=True).agg([
        pl.col("value_right").first().alias("value"),
        pl.col("value").std().alias("stdev"),
    ])
    .drop("row_nr_right")
)

.join df with row count on itself and remove the rows where the two row counts are identical. Then group by answer and row_nr_right and (1) pick the first group item out of value_right and (2) calculate the standard deviation over the value group.

Result for

df = pl.DataFrame({
    "answer": ["yes", "yes", "yes", "yes", "yes", "maybe", "maybe", "maybe", "maybe"],
    "value": [5, 10, 7, 8, 4, 6, 9, 10, 4],
})

is

┌────────┬───────┬──────────┐
│ answer ┆ value ┆ stdev    │
│ ---    ┆ ---   ┆ ---      │
│ str    ┆ i64   ┆ f64      │
╞════════╪═══════╪══════════╡
│ yes    ┆ 5     ┆ 2.5      │
│ yes    ┆ 10    ┆ 1.825742 │
│ yes    ┆ 7     ┆ 2.753785 │
│ yes    ┆ 8     ┆ 2.645751 │
│ ...    ┆ ...   ┆ ...      │
│ maybe  ┆ 6     ┆ 3.21455  │
│ maybe  ┆ 9     ┆ 3.05505  │
│ maybe  ┆ 10    ┆ 2.516611 │
│ maybe  ┆ 4     ┆ 2.081666 │
└────────┴───────┴──────────┘
Timus
  • 10,974
  • 5
  • 14
  • 28
0

I am not sure if my solution retains the (performance) advantages of using polars (instead of regular pandas), but I find it easier to maintain and more readable than the other answers. With all the iterative conversion of data from Series to a list I expect this won't scale well, but perhaps your usecase does not require that.

Starting with the data (thanks to this answer for a minimal working dataset):

import polars as pl
import statistics as stats

df = pl.DataFrame(
    {
        "answer": ["yes", "yes", "yes", "yes", "maybe", "maybe", "maybe"],
        "value": [5, 10, 7, 8, 6, 9, 10],
    }
)
df
shape: (7, 2)
┌────────┬───────┐
│ answer ┆ value │
│ ---    ┆ ---   │
│ str    ┆ i64   │
╞════════╪═══════╡
│ yes    ┆ 5     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 10    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 7     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes    ┆ 8     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 6     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 9     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe  ┆ 10    │
└────────┴───────┘

Now define a custom function to do the actual calculation (thanks to this answer for an example of a custom polars aggregation function). I.e. take a list of values, move one-by-one through the list and calculate the standard deviation of the values in the list, except for the current value:

def custom_std(args: list[pl.Series]) -> pl.Series:
    output = []
    # Iterate over the values within the group
    for idx in range(0, len(args[0])):
        # Convert the Series to a list, as Polars does not have a method that can delete individual elements
        temp = args[0].to_list()
        # Delete value fo the current row
        del temp[idx]
        # Now calculate the std. dev. on the remaining values
        # I arbitrarly chose the sample standard deviation, adjust accordingly to your situation
        result = stats.stdev(temp)
        # Store the result in a list and go to the next iteration
        output.append(result)
    # The dtype is not correct, but I can't find how to specify that this series contains a list of floats
    return pl.Series(output, dtype=pl.Float64)

Use this function in a groupby:

gdf = df.groupby(by=["answer"], maintain_order=True).agg(pl.apply(f=custom_std, exprs=["value"]))
gdf
┌────────┬─────────────────────────────────────┐
│ answer ┆ value                               │
│ ---    ┆ ---                                 │
│ str    ┆ list [f64]                          │
╞════════╪═════════════════════════════════════╡
│ yes    ┆ [1.247219, 1.247219, ... 2.05480... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ maybe  ┆ [0.5, 2.0, 1.5]                     │
└────────┴─────────────────────────────────────┘

To get it in the desired format explode the resulting DataFrame:

gdf.explode("value")
┌────────┬──────────┐
│ answer ┆ value    │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ yes    ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes    ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes    ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes    ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe  ┆ 0.707107 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe  ┆ 2.828427 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe  ┆ 2.1213   │
└────────┴──────────┘
Saaru Lindestøkke
  • 2,067
  • 1
  • 25
  • 51