1

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time

I want to drop every user who makes less than 3 operations.

every operation is stored in a row in which user_id is not the ID of my data

undesirable_users=[] 
for i in range(len(operations_per_user)):
    if operations_per_user.get_value(operations_per_user.index[i])<=3:
        undesirable_users.append(operations_per_user.index[i])

for i in range(len(undesirable_users)):
    data = data.drop(data[data.user_id == undesirable_users[i]].index)

data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().

Valentino
  • 7,291
  • 6
  • 18
  • 34
Nirmine
  • 91
  • 10
  • 1
    Can you show what `operations_per_user` and `data` look like? Might make it easier to suggest a solution – razdi May 02 '19 at 23:12
  • concerning my data it is a dataframe and operation per user is a series that is created by : operation_per_user = data['user_id'].value_counts() – Nirmine May 03 '19 at 00:04

3 Answers3

0
  • If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)

Edit

Instead of creating a seperate series, you could add operations_per_user to data with:

data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()

You could either perform the drop as above or perform the selection with the inverse logical condition:

data = data.loc[data['operations_per_user' > 3]]

Original

It would be preferable if you could supply some more information about the variables used in your code.

  • If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[] 
for i in operations_per_user.index:
    if operations_per_user.loc[i] <= 3:
        undesirable_users.append(i)

The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.

  • You can use python lists as iterators; for your second loop:
for user in undesirable_users:
    data = data.drop(data.loc[data['user_id'] == user].index)
Dagorodir
  • 104
  • 1
  • 10
  • thanks! concerning the usage of get_value() it works with me usually very faster than iloc and loc. moreover, I had used your code but it s executed slowly. I need a faster method. thank you again – Nirmine May 03 '19 at 00:02
  • concerning my data it is a dataframe and operation per user is a series that is created by : operation_per_user = data['user_id'].value_counts() – Nirmine May 03 '19 at 00:03
  • @Nirmine - See the edit, you can just add the values from the `operation_per_user` series to `data` as a column, then perform the drop/logical selection in one line. – Dagorodir May 03 '19 at 00:14
0

Why not just filter them? You don't need to loop at all.

You can get the filtered indexes by:

operations_per_user.index[operations_per_user <= 3]

And then you can filter these indexes from the df, making the solution:

data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]

EDIT

My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.

filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()

data = data[~data[user_id].isin(filtered_user_ids)]
razdi
  • 1,388
  • 15
  • 21
  • thanks! but I get this erreor : TypeError: 'Series' objects are mutable, thus they cannot be hashed – Nirmine May 02 '19 at 23:56
  • thanks a lot but I get the following error during executng the second instruction( data=data[.....]) : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – Nirmine May 03 '19 at 01:03
  • That was a quick fix. Have a look – razdi May 03 '19 at 01:13
  • Thank you razdi I do appreciate your help It works now – Nirmine May 03 '19 at 01:17
0

Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.

First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.

keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]
Valentino
  • 7,291
  • 6
  • 18
  • 34