4

I have some entries from users and how many interactions this user had on my website...

I have 340k rows and 70+ columns, and I want to use Vaex, but I'm having problems to do simple things like to drop duplicates.

Could someone help me on how to do it?

import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Bob', 'Alice', 'Alice', 'Alice', "Ralph", "Ralph"], 
                   'date': ['2013-12-05', '2014-02-05', '2013-11-07', '2014-04-22', '2014-04-30',  '2014-04-20', '2014-05-29'],
                   'interaction_num': ['1', '2', '1', '2', '3', '1','2']})

I want to have the same result of the pandas.drop_duplicates(keep="last") function

df.drop_duplicates('user', keep='last', inplace=True)

the expected result using Vaex should be:

    user    date    interaction_num
1   Bob     2014-02-05  2
4   Alice   2014-04-30  3
6   Ralph   2014-05-29  2
Asclepius
  • 57,944
  • 17
  • 167
  • 143
Leonardo Ferreira
  • 673
  • 1
  • 6
  • 22
  • Does this answer your question? [Drop duplicate rows in python vaex](https://stackoverflow.com/questions/62937249/drop-duplicate-rows-in-python-vaex) – Superdooperhero Dec 11 '21 at 07:32

2 Answers2

1

Duplicate question

It seems there is none yet, but we should expect this functionality at some point.

In the meantime, there is an attempt from the creator of vaex

radupm
  • 108
  • 6
0

The code adapted from https://github.com/vaexio/vaex/pull/1623/files works for me:

def drop_duplicates(df, columns=None):
    """Return a :class:`DataFrame` object with no duplicates in the given columns.
    .. warning:: The resulting dataframe will be in memory, use with caution.
    :param columns: Column or list of column to remove duplicates by, default to all columns.
    :return: :class:`DataFrame` object with duplicates filtered away.
    """
    if columns is None:
        columns = df.get_column_names()
    if type(columns) is str:
        columns = [columns]
    return df.groupby(columns, agg={'__hidden_count': vaex.agg.count()}).drop('__hidden_count')
Vicky Ruiz
  • 113
  • 1
  • 7