I have to use data from two pandas dataframes but I'm having trouble figuring out how to remove data efficiently from the datasets. The df_books
dataframe contains roughly 300k entries which includes book details (isbn, title, and author), while the df_ratings
dataframe contains 1.1 million entries including user rating details (user, isbn, rating).
Format of the data to be cleaned:
# import csv data into dataframes
df_books = pd.read_csv(
books_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['isbn', 'title', 'author'],
usecols=['isbn', 'title', 'author'],
dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})
df_ratings = pd.read_csv(
ratings_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'isbn', 'rating'],
usecols=['user', 'isbn', 'rating'],
dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})
df_books,df_ratings = df_books.assign(keep='yes'),df_ratings.assign(keep='yes')
The project specifies that:
If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.
Thus, in order to remove certain rows of data, I need to group the data (by isbn #) that I find in the df_ratings
dataframe based upon the specified condition:
df_groupby_isbn = df_ratings.groupby('isbn').count()[lambda x: x['rating']<100]
df_groupby_user = df_ratings.groupby('user').count()[lambda x: x['rating']<200]
My issue is that I can't seem to figure out how to drop rows efficiently from the df_ratings
dataframe based on the condition above.
I've tried to directly call the .drop()
method using df_groupby_isbn
and it ended up doing nothing:
df_ratings.drop(df_groupby_isbn,axis=1)
print(len(df_ratings))
I've also looked into vectorized methods (couldn't figure it out) and used a for-loop to check for the rows which fell under the condition, but it was incredibly slow and the process did not finish.
Example dataframe:
import pandas as pd
df_ratings = pd.DataFrame({'user':[276725,276726,276727],'isbn':['034545104X','0155061224','0446520802'],'rating':[7.0,5.0,3.0],'keep':['yes','yes','yes']})
How can I loop through the data frame to check (and remove) a row if it contains a matching column value under either of the conditions above?