How to drop a series of rows from dataframe in a faster way

Question

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time

I want to drop every user who makes less than 3 operations.

every operation is stored in a row in which user_id is not the ID of my data

undesirable_users=[] 
for i in range(len(operations_per_user)):
    if operations_per_user.get_value(operations_per_user.index[i])<=3:
        undesirable_users.append(operations_per_user.index[i])

for i in range(len(undesirable_users)):
    data = data.drop(data[data.user_id == undesirable_users[i]].index)

data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().

Can you show what `operations_per_user` and `data` look like? Might make it easier to suggest a solution — razdi, May 02 '19 at 23:12
concerning my data it is a dataframe and operation per user is a series that is created by : operation_per_user = data['user_id'].value_counts() — Nirmine, May 03 '19 at 00:04

Dagorodir · Answer 1 · 2019-05-03T00:28:47.833

If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:

data = data.drop(data.loc[data['operations_per_user'] <= 3].index)

Edit

Instead of creating a seperate series, you could add operations_per_user to data with:

data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()

You could either perform the drop as above or perform the selection with the inverse logical condition:

data = data.loc[data['operations_per_user' > 3]]

Original

It would be preferable if you could supply some more information about the variables used in your code.

If operations_per_user is a pandas Series, your first loop could be improved with:

undesirable_users=[] 
for i in operations_per_user.index:
    if operations_per_user.loc[i] <= 3:
        undesirable_users.append(i)

The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.

You can use python lists as iterators; for your second loop:

for user in undesirable_users:
    data = data.drop(data.loc[data['user_id'] == user].index)

thanks! concerning the usage of get_value() it works with me usually very faster than iloc and loc. moreover, I had used your code but it s executed slowly. I need a faster method. thank you again — Nirmine, May 03 '19 at 00:02
concerning my data it is a dataframe and operation per user is a series that is created by : operation_per_user = data['user_id'].value_counts() — Nirmine, May 03 '19 at 00:03
@Nirmine - See the edit, you can just add the values from the `operation_per_user` series to `data` as a column, then perform the drop/logical selection in one line. — Dagorodir, May 03 '19 at 00:14

razdi · Accepted Answer · 2019-05-03T01:13:23.657

0

Why not just filter them? You don't need to loop at all.

You can get the filtered indexes by:

operations_per_user.index[operations_per_user <= 3]

And then you can filter these indexes from the df, making the solution:

data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]

EDIT

My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.

filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()

data = data[~data[user_id].isin(filtered_user_ids)]

edited May 03 '19 at 01:13

answered May 02 '19 at 23:19

razdi

1,388
15
21

thanks! but I get this erreor : TypeError: 'Series' objects are mutable, thus they cannot be hashed – Nirmine May 02 '19 at 23:56
thanks a lot but I get the following error during executng the second instruction( data=data[.....]) : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – Nirmine May 03 '19 at 01:03
That was a quick fix. Have a look – razdi May 03 '19 at 01:13
Thank you razdi I do appreciate your help It works now – Nirmine May 03 '19 at 01:17

Valentino · Answer 3 · 2019-05-03T00:54:14.217

0

Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.

First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.

keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]

edited May 03 '19 at 00:54

answered May 03 '19 at 00:03

Valentino

7,291
6
18
34

How to drop a series of rows from dataframe in a faster way

3 Answers3