Drop duplicate rows in python vaex

Question

I am working with python vaex, and I don't know how I can drop duplicate rows in a dataframe. For example in pandas there exists the method drop_duplicates(). Does there exist any similar function in vaex?

score 2 · Answer 1 · answered Feb 27 '21 at 18:48

2

It seems there is none yet, but we should expect this functionality at some point.

In the meantime, there is an attempt from the creator of vaex

answered Feb 27 '21 at 18:48

radupm

108
6

score 1 · Answer 2 · answered Dec 10 '21 at 16:07

1

I went with this groupby approach:

import vaex
df = vaex.from_arrays(x=[1, 2, 3, 4, 1, 2, 3, 4],
                      s=['a', 'b', 'c', 'd', 'A', 'b', 'c', 'D'],
                      q=[0, 0, 0, 0, 0, 1, 0, 0])
df['new'] = df.x
dfg = df.groupby(['x', 's', 'q']).agg({'new': "sum"})['x', 's', 'q']
dfg

So basically you add some sort of numeric column and then group over the original columns and sum on the new column and then just get rid of the new sum; leaving the unique (grouped) list of original columns.

answered Dec 10 '21 at 16:07

Superdooperhero

7,584
19
83
138

1

This works, but keep in mind that the output is in memory. If your group-by output is too big to fit in ram, this approach will not work. – Joco Dec 18 '21 at 22:41
Surely `vaex` does out of core, so too big to fit in ram is not an issue? – Superdooperhero Dec 20 '21 at 05:50
It does, and the groupby aggregation is out of care also, but the resulting dataframe is in memory. So just be careful when doing groupbys with lots of columns – Joco Dec 20 '21 at 09:12

Drop duplicate rows in python vaex

2 Answers2

Linked

Related