For each row, I'm trying to compute the standard deviation for the other values in a group excluding the row's value. A way to think about it is "what would the standard deviation for the group be if this row's value was removed". An example may be easier to parse:
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
I would want to add a column that would have the first row be std([10,7,8]) = 1.527525
I tried to hack something together and ended up with code that is horrible to read and also has a bug that I don't know how to work around:
df.with_column(
(
(pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).alias("average_other")
).with_column(
(
(
(
(pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
- (pl.col("value") - pl.col("average_other")).pow(2)
)
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).sqrt()
).alias("std_dev_other")
)
I'm not sure I would recommend parsing that, but I'll point out at least one thing that is wrong:
pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
I want to be comparing "value" in each row to "average_other" from this row then squaring and summing over the window but instead I am comparing "value" in each row to "average_other" in each row.
My main question is the "what is the best way to get the standard deviation while leaving out this value?" part. But I would also be interested if there is a way to do the comparison that I'm doing wrong above. Third would be tips on how to write this in way that is easy to understand what is going on.