8

I am working with python vaex, and I don't know how I can drop duplicate rows in a dataframe. For example in pandas there exists the method drop_duplicates(). Does there exist any similar function in vaex?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
rootware
  • 81
  • 1
  • 3

2 Answers2

2

It seems there is none yet, but we should expect this functionality at some point.

In the meantime, there is an attempt from the creator of vaex

radupm
  • 108
  • 6
1

I went with this groupby approach:

import vaex
df = vaex.from_arrays(x=[1, 2, 3, 4, 1, 2, 3, 4],
                      s=['a', 'b', 'c', 'd', 'A', 'b', 'c', 'D'],
                      q=[0, 0, 0, 0, 0, 1, 0, 0])
df['new'] = df.x
dfg = df.groupby(['x', 's', 'q']).agg({'new': "sum"})['x', 's', 'q']
dfg

So basically you add some sort of numeric column and then group over the original columns and sum on the new column and then just get rid of the new sum; leaving the unique (grouped) list of original columns.

Superdooperhero
  • 7,584
  • 19
  • 83
  • 138
  • 1
    This works, but keep in mind that the output is in memory. If your group-by output is too big to fit in ram, this approach will not work. – Joco Dec 18 '21 at 22:41
  • Surely `vaex` does out of core, so too big to fit in ram is not an issue? – Superdooperhero Dec 20 '21 at 05:50
  • It does, and the groupby aggregation is out of care also, but the resulting dataframe is in memory. So just be careful when doing groupbys with lots of columns – Joco Dec 20 '21 at 09:12