Removing rows from pandas DataFrame efficiently?

Question

I have to use data from two pandas dataframes but I'm having trouble figuring out how to remove data efficiently from the datasets. The df_books dataframe contains roughly 300k entries which includes book details (isbn, title, and author), while the df_ratings dataframe contains 1.1 million entries including user rating details (user, isbn, rating).

Format of the data to be cleaned:

# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

 df_books,df_ratings = df_books.assign(keep='yes'),df_ratings.assign(keep='yes')

The project specifies that:

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

Thus, in order to remove certain rows of data, I need to group the data (by isbn #) that I find in the df_ratings dataframe based upon the specified condition:

df_groupby_isbn = df_ratings.groupby('isbn').count()[lambda x: x['rating']<100]
df_groupby_user = df_ratings.groupby('user').count()[lambda x: x['rating']<200]

My issue is that I can't seem to figure out how to drop rows efficiently from the df_ratings dataframe based on the condition above.

I've tried to directly call the .drop() method using df_groupby_isbn and it ended up doing nothing:

df_ratings.drop(df_groupby_isbn,axis=1)
print(len(df_ratings))

I've also looked into vectorized methods (couldn't figure it out) and used a for-loop to check for the rows which fell under the condition, but it was incredibly slow and the process did not finish.

Example dataframe:

import pandas as pd
df_ratings = pd.DataFrame({'user':[276725,276726,276727],'isbn':['034545104X','0155061224','0446520802'],'rating':[7.0,5.0,3.0],'keep':['yes','yes','yes']})

How can I loop through the data frame to check (and remove) a row if it contains a matching column value under either of the conditions above?

Note that `drop` returns a new dataframe and it looks like you don't assign that to anything. If you want to change the original, you need to pass `inplace=True`: `df_ratings.drop(df_groupby_isbn,axis=1, inplace=True)`. — fsimonjetz, Jul 25 '21 at 22:50
It will help if you can provide a minimal data frame that is representative of the data so another user can try a solution before posting. See: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Idr, Jul 25 '21 at 22:59
Not a `machine-learning` or `numpy` question, kindly do not spam irrelevant tags (removed). — desertnaut, Jul 25 '21 at 23:21
When I first read the question I thought of `.value_counts()`. Found this as one example... [Python: Removing Rows on Count condition](https://stackoverflow.com/questions/49735683/python-removing-rows-on-count-condition) — MDR, Jul 25 '21 at 23:26
@fsimonjetz Looks like that isn't working for me either. I checked the length of the dataframe and it didn't change after calling the method. — ktarahb, Jul 26 '21 at 17:57

Removing rows from pandas DataFrame efficiently?

0 Answers0