2

I am new to polars and I wonder what is the equivalent of pandas groupby.apply(drop_duplicates) in polars. Here is the code snippet I need to translate :

import pandas as pd

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pd.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').apply(lambda x: x.drop_duplicates())

input data desired output

I tried the following, but, it does not output what I need

import polars as pl

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pl.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').agg(pl.all().unique())

If I take only one Group, I get locally what I want :

df_part = df_input.filter(pl.col('GROUP')=='2')
df_part[['OPERATION', 'BATCH']].unique()

Does somebody know how to do that ?

Upsikan
  • 23
  • 5

3 Answers3

1

It looks like you want the first instance of each OPERATION, BATCH "pairing" per GROUP

You can use pl.struct to create the "pairing", and then .is_first().over("GROUP")

(df.with_row_count()
   .filter(pl.struct("OPERATION", "BATCH").is_first().over("GROUP")))
shape: (9, 4)
┌────────┬───────┬───────────┬───────┐
│ row_nr ┆ GROUP ┆ OPERATION ┆ BATCH │
│ ---    ┆ ---   ┆ ---       ┆ ---   │
│ u32    ┆ str   ┆ str       ┆ str   │
╞════════╪═══════╪═══════════╪═══════╡
│ 0      ┆ 1     ┆ A         ┆ 7     │
│ 1      ┆ 2     ┆ A         ┆ 7     │
│ 2      ┆ 3     ┆ A         ┆ 7     │
│ 3      ┆ 2     ┆ B         ┆ 8     │
│ 4      ┆ 3     ┆ B         ┆ 9     │
│ 5      ┆ 1     ┆ A         ┆ 8     │
│ 7      ┆ 3     ┆ A         ┆ 9     │
│ 10     ┆ 2     ┆ A         ┆ 8     │
│ 11     ┆ 1     ┆ B         ┆ 9     │
└────────┴───────┴───────────┴───────┘

The .with_row_count is just added to show what rows were kept/removed.

jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Hi, I Just needed to add some brackets to make it work : (df.with_row_count() .filter(pl.struct(["OPERATION", "BATCH"]).is_first().over("GROUP"))) Speed increased by a factor of ten compared to my answer, thank you for your answer ! – Upsikan May 08 '23 at 14:02
  • @Upsikan Are there other columns in your real data? Because your example can be solved with just: `df.unique()` - this answer is assuming you want to perform the unique/duplicate check on a subset of columns. – jqurious May 08 '23 at 14:12
  • What I initially intended to do can indeed be solved by df.unique(). However your main answer gives the opportunity to keep additional variables (such as an ID in my case), which could be beneficial, for double checking. When it comes to performance df.unique is 10 times faster, so I'll use that solution. In any case I learnt how to use struct, thank you a lot ! – Upsikan May 08 '23 at 14:36
1

The idiomatic way to do this is to just use unique(subset=[]), potentially with the optional maintain_order=True and keep='first'

For instance,

df_input.unique(['OPERATION','BATCH','GROUP'],maintain_order=True, keep='first')

shape: (9, 3)
┌───────┬───────────┬───────┐
│ GROUP ┆ OPERATION ┆ BATCH │
│ ---   ┆ ---       ┆ ---   │
│ str   ┆ str       ┆ str   │
╞═══════╪═══════════╪═══════╡
│ 1     ┆ A         ┆ 7     │
│ 2     ┆ A         ┆ 7     │
│ 3     ┆ A         ┆ 7     │
│ 2     ┆ B         ┆ 8     │
│ 3     ┆ B         ┆ 9     │
│ 1     ┆ A         ┆ 8     │
│ 3     ┆ A         ┆ 9     │
│ 2     ┆ A         ┆ 8     │
│ 1     ┆ B         ┆ 9     │
└───────┴───────────┴───────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
0

I figured out how to do it :

import polars as pl

GROUP = list('123231232121212321')
OPERATION = list('AAABBABAAAABBABBBA')
BATCH = list('777898897889878987')

df_input = pl.DataFrame({'GROUP':GROUP, 'OPERATION':OPERATION, 'BATCH':BATCH})
df_output = df_input.groupby('GROUP').apply(lambda x: x.unique())
df_output.sort(by ='GROUP')

Please let me know if you find another way to do it ! I guess it could be more efficient.

Upsikan
  • 23
  • 5