4

In this example, on columns ["foo", "ham"], I want rows 1 and 4 to be removed since they match a pair in the list

df = pl.DataFrame(
    {
        "foo": [1, 1, 2, 2, 3, 3, 4],
        "bar": [6, 7, 8, 9, 10, 11, 12],
        "ham": ["a", "b", "c", "d", "e", "f", "b"]
    }
)
pairs = [(1,"b"),(3,"e"),(4,"g")]

The following worked for me but I think this will be problematic when the dataframe and list of pairs are large.

for a, b in pairs:
    df = df.filter(~(pl.col('foo') == a) | ~(pl.col('ham') == b))

I think this is the pandas implementation for this problem Pandas: How to remove rows from a dataframe based on a list of tuples representing values in TWO columns?

I am not sure what the Polars implementation of it is.

(I think this problem can be generalized to any number of selected columns and any number of elements in a group. For instance, rather than a list of pairs, it can be another dataframe. You get the 'set difference', in terms of rows, of the two dataframes based on specific columns.)

pikaft
  • 43
  • 5

1 Answers1

8

It looks like an anti join

schema = ["foo", "ham"]

df.with_row_count().join(
   pl.DataFrame(pairs, schema=schema), 
   on=schema, 
   how="anti"
)
shape: (5, 4)
┌────────┬─────┬─────┬─────┐
│ row_nr ┆ foo ┆ bar ┆ ham │
│ ---    ┆ --- ┆ --- ┆ --- │
│ u32    ┆ i64 ┆ i64 ┆ str │
╞════════╪═════╪═════╪═════╡
│ 0      ┆ 1   ┆ 6   ┆ a   │
│ 2      ┆ 2   ┆ 8   ┆ c   │
│ 3      ┆ 2   ┆ 9   ┆ d   │
│ 5      ┆ 3   ┆ 11  ┆ f   │
│ 6      ┆ 4   ┆ 12  ┆ b   │
└────────┴─────┴─────┴─────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Nice! Just wondering ... Is there any syntax to accomplish the same thing using `df.filter(...`, regardless of how contorted it may be? – rickhg12hs Feb 25 '23 at 21:46
  • 1
    It should be possible. I've added an example which I think is equivalent. – jqurious Feb 25 '23 at 22:08
  • Any way to make it column name _"agnostic"_? For example, if we already have `pairs = pl.DataFrame(pairs, schema=schema)`, i.e., `pairs` is a `pl.DataFrame` with possibly fewer columns than `df`, is there a `df.filter(...` where there is no need to specify column names (because they're already part of `pairs`)? – rickhg12hs Feb 25 '23 at 22:36
  • 1
    If you have 2 dataframes, I think you'd need to join them first anyways. – jqurious Feb 25 '23 at 22:55
  • @rickhg12hs you could do it as a filter with a generator or list comprehension. Just make another question to get a complete answer as it doesn't really fit in a comment. – Dean MacGregor Feb 26 '23 at 13:26
  • @DeanMacGregor I'm a `polars` noob and just saw this question/answer as an opportunity to learn. The `anti` `join` satisfies my current curiosity. :-) – rickhg12hs Feb 26 '23 at 13:29
  • 1
    Although if your `pairs` is already a df then it makes no sense since you'd be converting it to a list of rows in the process. For instance you could do `df.filter(~((pl.col('foo') == tup[0]) & (pl.col('ham') == tup[1])) for tup in pairs)` btw, on mobile so I didn't test that. – Dean MacGregor Feb 26 '23 at 13:31
  • 1
    If `pairs` were a df then you could hack something together with pairs.columns and pairs.rows but it'd be a definite anti pattern. – Dean MacGregor Feb 26 '23 at 13:39